Morgan Kaufmann Publishers 是 Elsevier 旗下的一个出版公司。
Morgan Kaufmann Publishers is an imprint of Elsevier.
30 Corporate Drive, Suite 400,
30 Corporate Drive, Suite 400,
美国马萨诸塞州伯灵顿 01803
Burlington, MA 01803, USA
本书采用无酸纸印刷。
This book is printed on acid-free paper.
版权所有 © 2009 Jerome H. Saltzer 和 M. Frans Kaashoek。保留所有权利。
Copyright © 2009 by Jerome H. Saltzer and M. Frans Kaashoek. All rights reserved.
公司用来区分其产品的名称通常被视为商标或注册商标。在 Morgan Kaufmann Publishers 知晓的所有情况下,产品名称均以首字母大写或全部大写字母显示。本作品中出现或以其他方式提及的所有商标均属于其各自的所有者。Morgan Kaufmann Publishers 与此类商标所有者没有任何关系或从属关系,此类商标所有者也不确认、认可或批准本作品的内容。但是,读者应联系相应的公司以获取有关商标和任何相关注册的更多信息。
Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names appear in initial capital or all capital letters. All trademarks that appear or are otherwise referred to in this work belong to their respective owners. Morgan Kaufmann Publishers does not have any relationship or affiliation with such trademark owners nor do such trademark owners confirm, endorse or approve the contents of this work. Readers, however, should contact the appropriate companies for more information regarding trademarks and any related registrations.
未经出版商事先书面许可,不得以任何形式或任何手段(电子、机械、影印、扫描或其他方式)复制、存储在检索系统中或传播本出版物的任何部分。
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopying, scanning, or otherwise—without prior written permission of the publisher.
可以直接向位于英国牛津的爱思唯尔科技版权部申请许可:电话:(+44) 1865 843830,传真:(+44) 1865 853333,电子邮件:permissions@elsevier.com 。您也可以通过爱思唯尔主页 ( http://elsevier.com )在线完成申请,选择“支持与联系”,然后选择“版权与许可”,再选择“获取许可”。
Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, E-mail: permissions@elsevier.com. You may also complete your request online via the Elsevier homepage (http://elsevier.com), by selecting “Support & Contact” then “Copyright and Permission” and then “Obtaining Permissions.”
建议、评论和更正:请通过电子邮件发送至Saltzer@mit.edu和kaashoek@mit.edu
Suggestions, comments, and corrections: Please send correspondence by e-mail to Saltzer@mit.edu and kaashoek@mit.edu
美国国会图书馆出版品目錄數據
Library of Congress Cataloging-in-Publication Data
已提交申请
Application submitted
国际标准书号:978-0-12-374957-4
ISBN: 978-0-12-374957-4
有关所有 Morgan Kaufmann 出版物的信息,请访问我们的网站www.mkp.com或www.elsevierdirect.com
For information on all Morgan Kaufmann publications, visit our Web site at www.mkp.com or www.elsevierdirect.com
美国印刷
Printed in the United States of America
09 10 11 12 13 10 9 8 7 6 5 4 3 2 1
09 10 11 12 13 10 9 8 7 6 5 4 3 2 1
排版:diacriTech,钦奈,印度
Typeset by: diacriTech, Chennai, India
侧边栏 1.1:阻止一艘超级油轮
Sidebar 1.1: Stopping a Supertanker
侧边栏 1.2:飞机为什么不能飞
Sidebar 1.2: Why Airplanes can’t Fly
侧边栏 1.3:术语:用于描述系统组成的词语
Sidebar 1.3: Terminology: Words Used to Describe System Composition
侧边栏 1.4:角色和组织阵容
Sidebar 1.4: The Cast of Characters and Organizations
侧边栏 1.5:模块化如何重塑计算机行业
Sidebar 1.5: How Modularity Reshaped the Computer Industry
侧边栏 1.6:为什么计算机技术会随着时间的推移而呈指数级增长
Sidebar 1.6: Why Computer Technology has Improved Exponentially with Time
边栏 2.1:术语:耐久性、稳定性和持久性
Sidebar 2.1: Terminology: Durability, Stability, and Persistence
边栏 2.2:磁盘的工作原理
Sidebar 2.2: How Magnetic Disks Work
边栏 2.3:表示:伪代码和消息
Sidebar 2.3: Representation: Pseudocode and Messages
边栏 2.4:什么是操作系统?
Sidebar 2.4: What is an Operating System?
侧边栏 2.5:人体工程学和最小惊讶原则
Sidebar 2.5: Human Engineering and the Principle of Least Astonishment
边栏 3.1:根据时间戳生成唯一名称
Sidebar 3.1: Generating a Unique Name from a Timestamp
边栏 3.2:莎士比亚电子档案中的超文本链接
Sidebar 3.2: Hypertext Links in the Shakespeare Electronic Archive
边栏 4.1:使用高级语言实现模块化
Sidebar 4.1: Enforcing Modularity with a High-Level Languages
边栏 4.2:表示:时序图
Sidebar 4.2: Representation: Timing Diagrams
边栏 4.3:表示:大端还是小端?
Sidebar 4.3: Representation: Big-Endian or Little-Endian?
侧边栏 4.4:X Window 系统
Sidebar 4.4: The X Window System
侧边栏 4.5:点对点:无需可信中介的计算
Sidebar 4.5: Peer-to-peer: Computing without Trusted Intermediaries
侧边栏 5.1:RSM ,测试并设置和避免锁定
Sidebar 5.1: RSM, Test-and-Set and Avoiding Locks
侧边栏 5.2:构建无需特殊指令的前后动作
Sidebar 5.2: Constructing a Before-or-After Action without Special Instructions
侧边栏 5.3:引导操作系统
Sidebar 5.3: Bootstrapping an Operating System
侧边栏 5.4:进程、线程和地址空间
Sidebar 5.4: Process, Thread, and Address Space
侧边栏 5.5:位置无关程序
Sidebar 5.5: Position-Independent Programs
侧边栏 5.6:中断、异常、故障、陷阱和信号
Sidebar 5.6: Interrupts, Exceptions, Faults, Traps, and Signals
侧边栏 5.7:使用信号量避免丢失通知问题
Sidebar 5.7: Avoiding the Lost Notification Problem with Semaphores
边栏 6.1:设计提示:如有疑问,请使用强力破解
Sidebar 6.1: Design Hint: When in Doubt Use Brute Force
边栏 6.2:设计提示:针对常见情况进行优化
Sidebar 6.2: Design Hint: Optimize for the Common Case
边栏 6.3:设计提示:与其减少延迟,不如隐藏它
Sidebar 6.3: Design Hint: Instead of Reducing Latency, Hide It
侧边栏 6.4:RAM 延迟
Sidebar 6.4: RAM Latency
侧边栏 6.5:设计提示:将机制与策略分离
Sidebar 6.5: Design Hint: Separate Mechanism from Policy
侧边栏 6.6:OPT 是一种堆栈算法和最优
Sidebar 6.6: OPT is a Stack Algorithm and Optimal
侧边栏 6.7:接收活锁
Sidebar 6.7: Receive Livelock
侧边栏 6.8:优先级反转
Sidebar 6.8: Priority Inversion
Sidebar 7.1: Error Detection, Checksums, and Witnesses
Sidebar 7.3: Framing Phase-Encoded Bits
Sidebar 7.4: Shannon’s Capacity Theorem
Sidebar 7.5: Other End-to-End Transport Protocol Interfaces
Sidebar 7.6: Exponentially Weighted Moving Averages
Sidebar 7.7: What does an Acknowledgment Really Mean?
Sidebar 8.1: Reliability Functions
Sidebar 8.2: Risks of Manipulating MTTFs
Sidebar 9.1: Actions and Transactions
Sidebar 9.2: Events that Might Lead to Invoking an Exception Handler
Sidebar 11.2: Should Designs and Vulnerabilities be Public?
侧边栏 11.3:恶意软件:病毒、蠕虫、特洛伊木马、逻辑炸弹、机器人、驱动下载等。
Sidebar 11.3: Malware: Viruses, Worms, Trojan Horses, Logic Bombs, Bots, Drive-by Downloads, etc.
Sidebar 11.4: Why are Buffer Overrun Bugs so Common?
Sidebar 11.5: Authenticating Personal Devices: the Resurrecting Duckling Policy
Sidebar 11.6: The Kerberos Authentication System
据我们所知,这本教科书在范围和方法上都是独一无二的。它广泛而深入地介绍了工程计算机系统的主要原理和抽象,无论是操作系统、客户端/服务应用程序、数据库系统、安全网站还是容错磁盘群集。这些原理和抽象是永恒的,对任何学生或专业读者都很有价值,无论他们是否专门研究计算机系统。这些原理和抽象源自几代计算机系统中已被证明有效的见解、作者自己在构建计算机系统方面的经验以及几十年的教学经验。
To the best of our knowledge this textbook is unique in its scope and approach. It provides a broad and in-depth introduction to the main principles and abstractions for engineering computer systems, be it an operating system, a client/service application, a database system, a secure Web site, or a fault-tolerant disk cluster. These principles and abstractions are timeless and are of value to any student or professional reader, whether or not specializing in computer systems. The principles and abstractions derive from insights that have proven to work over generations of computer systems, the authors’ own experience with building computer systems, and teaching about them for several decades.
本书讲授了广泛的原理和抽象概念,但对它们进行了深入探讨。它使用伪代码来捕捉概念的核心,以便读者可以测试他们对概念具体实例的理解。本书使用伪代码仔细记录了客户端/服务计算、远程过程调用、文件、线程、地址空间、尽力而为网络、原子性、经过身份验证的消息等的本质。这种方法在问题集中延续,读者可以通过研究伪代码来探索各种系统的设计。
The book teaches a broad set of principles and abstractions, yet it explores them in depth. It captures the core of a concept using pseudocode so that readers can test their understanding of a concrete instance of the concept. Using pseudocode, the book carefully documents the essence of client/service computing, remote procedure calls, files, threads, address spaces, best-effort networks, atomicity, authenticated messages, and so on. This approach continues in the problem sets, where readers can explore the design of a wide range of systems by studying their pseudocode.
本印刷版教科书是两部分出版物的第一部分,仅包含前六章。第二部分包括第 7-11 章和其他支持材料,作为开放教育资源在线发布。有关如何以及在哪里在线查找第二部分的详细信息,请参阅第 xxix 页的“在哪里可以找到第二部分和其他在线材料”。
This printed textbook is Part I of a two-part publication, containing just the first six chapters. Part II, consisting of Chapters 7–11 and additional supporting materials, is posted on-line as an open educational resource. For details of how and where to find Part II on-line, see “Where to find Part II and other on-line materials” on page xxix.
计算机系统的许多基本概念,例如设计原则、模块化、命名、抽象、并发、通信、容错和原子性,在计算机科学与工程 (CSE) 课程的几个高级选修课中很常见。典型的 CSE 课程以两门入门课程开始,一门是编程课程,一门是硬件课程。然后它分支出来,其中一个主要分支由面向系统的选修课组成,这些选修课带有以下标签:
Many fundamental ideas concerning computer systems, such as design principles, modularity, naming, abstraction, concurrency, communications, fault tolerance, and atomicity, are common to several of the upper-division electives of the Computer Science and Engineering (CSE) curriculum. A typical CSE curriculum starts with two beginning courses, one on programming and one on hardware. It then branches out, with one of the main branches consisting of systems-oriented electives that carry labels such as
这个列表的主要问题是它在过去三十年中不断增长,大多数对系统感兴趣的学生没有时间学习所有甚至几门课程。CSE 课程的典型反应是要求“选择三门”或“选择操作系统和另外两门”。结果是大多数学生最终对剩余的主题一无所知。此外,任何选修课都不能假设任何其他选修课在它之前已经完成,因此常见的材料最终会被重复多次。最后,不打算专攻系统但想要有一些背景知识的学生别无选择,只能深入研究一两个专业领域。
The primary problem with this list is that it has grown over the last three decades, and most students interested in systems do not have the time to take all or even several of those courses. The typical response is for the CSE curriculum to require either “choose three” or “take Operating Systems plus two more”. The result is that most students end up with no background at all in the remaining topics. In addition, none of the electives can assume that any of the other electives have preceded it, so common material ends up being repeated several times. Finally, students who are not planning to specialize in systems but want to have some background have little choice but to go into depth in one or two specialized areas.
本书贯穿所有这些课程,确定了共同的机制和设计原则,并深入解释了一组精心挑选的跨领域思想。这种方法提供了一个教授核心本科课程的机会,所有计算机科学和工程专业的学生都可以学习,无论他们是否打算专攻系统。一方面,那些只是系统用户的学生将获得坚实的基础,而另一方面,那些计划以设计系统为职业的学生可以通过与上述列表名称相同但更深入、重复更少的选修课更有效地学习更高级的材料。这两组人都将获得作者希望的永恒概念的广泛基础,而不是当前的和可能短暂的技术。我们发现这种课程结构在麻省理工学院是有效的
This book cuts across all of these courses, identifying common mechanisms and design principles, and explaining in depth a carefully chosen set of cross-cutting ideas. This approach provides an opportunity to teach a core undergraduate course that is accessible to all Computer Science and Engineering students, whether or not they intend to specialize in systems. On the one hand, students who will just be users of systems will take away a solid grounding, while on the other hand those who plan to plan to make a career out of designing systems can learn more advanced material more effectively through electives that have the same names as in the list above but with more depth and less duplication. Both groups will acquire a broad base of what the authors hope are timeless concepts rather than current and possibly short-lived techniques. We have found this course structure to be effective at M.I.T.
本书通过关注学生整个职业生涯中都会用到的基本而永恒的概念,而不是详细阐述即将过时的现有系统的运作机制,实现了广泛的覆盖范围,同时又不牺牲知识深度。本书的一个普遍理念是教学优先于职业培训。例如,本书不教授特定的操作系统,也不依赖单一的计算机架构。相反,它介绍了展示当代系统中主要思想的模型,但形式上没有那么多进化的痕迹。教学模式是,对于理解这些概念的人来说,任何特定系统的详细运作机制都可以轻松快速地从其他书籍或系统本身的文档中获得。同时,本书使用伪代码片段将概念具体化,以便学生有具体的东西来检查和测试他们对概念的理解。
The book achieves its extensive range of coverage without sacrificing intellectual depth by focusing on underlying and timeless concepts that will serve the student over an entire professional career, rather than providing detailed expositions of the mechanics of operation of current systems that will soon become obsolete. A pervading philosophy of the book is that pedagogy takes precedence over job training. For example, the text does not teach a particular operating system or rely on a single computer architecture. Instead it introduces models that exhibit the main ideas found in contemporary systems, but in forms less cluttered with evolutionary vestiges. The pedagogical model is that for someone who understands the concepts, the detailed mechanics of operation of any particular system can easily and quickly be acquired from other books or from the documentation of the system itself. At the same time, the text makes concepts concrete using pseudocode fragments, so that students have something specific to examine and to test their understanding of the concepts.
作者希望本书适合那些希望
The authors intend the book for students and professionals who will
Supervise the design of computer systems.
Engineer applications of computer systems to information management.
Direct the integration of computer systems within an organization.
Evaluate performance of computer systems.
Keep computer systems technologically up to date.
Go on to study individual topics such as networks, security, or transaction management in greater depth.
在计算机科学和工程的其他领域工作,但希望对计算机系统的主要思想有基本的了解。
Work in other areas of computer science and engineering, but would like to have a basic understanding of the main ideas about computer systems.
级别: 本书介绍了计算机系统。它并不试图探讨每个问题或深入探讨这些问题。相反,它的目标是让读者深入了解他或她在职业生涯的剩余时间将依赖的系统的复杂性以及与系统设计师互动所需的概念。它为操作系统、数据库系统、数据网络、计算机安全、分布式系统、容错计算和并发性的基础机制奠定了坚实的基础。读完本书后,读者原则上应该能够理解计算机系统许多方面的详细工程,准备阅读和理解当前有关系统的专业文献,并知道要问什么问题以及在哪里找到答案。
Level: This book provides an introduction to computer systems. It does not attempt to explore every issue or get to the bottom of those issues it does explore. Instead, its goal is for the reader to acquire insight into the complexities of the systems he or she will be depending on for the remainder of a career as well as the concepts needed to interact with system designers. It provides a solid foundation about the mechanisms that underlie operating systems, database systems, data networks, computer security, distributed systems, fault tolerant computing, and concurrency. By the end of the book, the reader should in principle be able to follow the detailed engineering of many aspects of computer systems, be prepared to read and understand current professional literature about systems, and know what questions to ask and where to find the answers.
本书有多种用途。它可以作为一学期、两个季度或三个季度的计算机系统系列课程的基础。或者,一两个选定的章节可以作为传统本科选修课或研究生课程的介绍,内容涉及操作系统、网络、数据库系统、分布式系统、安全、容错或并发。通过这种方式,一本书可以为学生提供多次使用。另一种可能性是,这本书可以作为系统研究生课程的基础,学生可以在其中复习本科时学到的领域并填补他们错过的领域。
The book can be used in several ways. It can be the basis for a one-semester, two-quarter, or three-quarter series on computer systems. Or one or two selected chapters can be an introduction of a traditional undergraduate elective or a graduate course in operating systems, networks, database systems, distributed systems, security, fault tolerance, or concurrency. Used in this way, a single book can serve a student several times. Another possibility is that the text can be the basis for a graduate course in systems in which students review those areas they learned as undergraduates and fill in the areas they missed.
先决条件: 本书严格限制了先决条件。作为教科书使用时,它适用于已经修过软件设计和计算机硬件组织入门课程的大三和大四学生,但不需要任何更高级的计算机科学或工程背景。它定义了新术语,避免使用专业术语,但它也假设读者已经从一两份暑期工作或先修课程的实验室工作中获得了计算机系统的一些实践经验。它不要求读者精通任何特定的计算机语言,而是能够将有关计算机编程语言的一般知识转移到伪代码示例中使用的各种有时是临时的编程语言中。
Prerequisites: The book carefully limits its prerequisites. When used as a textbook, it is intended for juniors and seniors who have taken introductory courses on software design and on computer hardware organization, but it does not require any more advanced computer science or engineering background. It defines new terms as it goes, and it avoids jargon, but nevertheless it also assumes that the reader has acquired some practical experience with computer systems from a summer job or two or from laboratory work in the prerequisite courses. It does not require that the reader be fluent in any particular computer language, but rather be able to transfer general knowledge about computer programming languages to the varied and sometimes ad hoc programming language used in pseudocode examples.
其他读者: 专业人士也应该会发现这本书很有用。它基于强制模块化,提供了现代和前瞻性的计算机系统设计视角。这种观点认识到,在过去的十年或二十年里,主要的设计挑战是控制复杂性,而不是应对资源限制。此外,在大学里只选修了计算机系统课程的一部分或专注于资源管理的操作系统课程的专业人士会发现,这本书以现代和更广阔的视角让他们耳目一新。
Other Readers: Professionals should also find this book useful. It provides a modern and forward-looking perspective on computer system design, based on enforcing modularity. This perspective recognizes that over the last decade or two, the primary design challenge has become that of keeping complexity under control rather than fighting resource constraints. In addition, professionals who in college took only a subset of the classes in computer systems or an operating systems class that focused on resource management will find that this text refreshes them with a modern and broader perspective.
练习和问题集: 教科书的每一章都以一些简答练习结尾,旨在测试对该章中某些概念的理解。书的最后是更长的问题集,挑战读者将概念应用于与现实世界中可能遇到的问题类似的新问题和不同问题。在大多数情况下,问题集需要来自多个章节的概念。每个问题集都标明了其重点关注的章节,但后面的问题集通常会从所有前面的章节中汲取概念。出版商在一本单独的教师用书中提供了练习答案和问题集的解决方案。
Exercises and Problem Sets: Each chapter of the textbook ends with a few short-answer exercises intended to test understanding of some of the concepts in that chapter. At the end of the book is a much longer collection of problem sets that challenge the reader to apply the concepts to new and different problems similar to those that might be encountered in the real world. In most cases, the problem sets require concepts from several chapters. Each problem set identifies the chapter or chapters on which it is focused, but later problem sets typically draw concepts from all earlier chapters. Answers to the exercises and solutions for the problem sets are available from the publisher in a separate book for instructors.
练习和问题集有多种使用方式:
The exercises and problem sets can be used in several ways:
作为学习工具。在这种模式下,答案和解决方案可供学生使用,鼓励学生做练习和问题集并自己想出答案和解决方案。通过将这些答案和解决方案与预期答案和解决方案进行比较,学生可以立即收到反馈,从而纠正错误概念并提出有关歧义或误解的问题。鼓励学习练习和解决方案的一种方法是宣布与一个或多个问题集相同或基于这些问题集的问题将出现在即将到来的考试中。
As tools for learning. In this mode, the answers and solutions are available to the student, who is encouraged to work the exercises and problem sets and come up with answers and solutions on his or her own. By comparing those answers and solutions with the expected ones, the student receives immediate feedback that can correct misconceptions and can raise questions about ambiguities or misunderstandings. One technique to encourage study of the exercises and solutions is to announce that questions identical to or based on one or more of the problem sets will appear on a forthcoming examination.
作为家庭作业或考试材料。在此模式下,练习和问题集被指定为家庭作业,学生提交答案,评估后连同答案和解决方案的副本一起交回。
As homework or examination material. In this mode, exercises and problem sets are assigned as homework, and the student hands in answers that are evaluated and handed back together with copies of the answers and solutions.
案例研究和阅读材料:为了补充文本,读者应阅读专业技术文献和案例研究。最后一章后面是精选的书籍和论文书目,这些书目和论文提供了有关系统研究的智慧、系统设计原则和案例研究。通过改变介绍的速度以及阅读材料的数量和知识深度,该文本可以作为一学期本科核心课程、两学期或三个季度本科课程或研究生水平的计算机系统入门课程的基础。
Case Studies and Readings: To complement the text, the reader should supplement it with readings from the professional technical literature and with case studies. Following the last chapter is a selected bibliography of books and papers that offer wisdom, system design principles, and case studies surrounding the study of systems. By varying the pace of introduction and the number and intellectual depth of the readings, the text can be the basis for a one-term undergraduate core course, a two-term or three-quarter undergraduate sequence, or a graduate-level introduction to computer systems.
项目:我们的经验是,对于涉及计算机系统诸多方面的课程,结合几个轻量级的动手作业(例如,通过实验确定个人计算机缓存的大小或跟踪互联网上的非对称路由),加上一两个较大的论文项目,这些项目需要一个小团队进行高级系统设计(例如,在一份 10 页的报告中为国会图书馆设计一个可靠的数字存储系统),可以成为课本的极佳补充。另一方面,需要学习特定系统内部的大量编程项目需要花费大量的家庭作业时间,以至于当与广泛的概念课程相结合时,它们会造成超负荷。包含编程项目的课程在后续的专业选修课中效果很好,例如,在操作系统、网络、数据库或分布式系统方面。出于这个原因,在麻省理工学院,我们在几门高级选修课中布置编程项目,但在基于这本教科书的系统课程中不布置。
Projects: Our experience is that for a course that touches many aspects of computer systems, a combination of several lightweight hands-on assignments (for example, experimentally determine the size of the caches of a personal computer or trace asymmetrical routes through the Internet), plus one or two larger paper projects that involve having a small team do a high-level system design (for example, in a 10-page report design a reliable digital storage system for the Library of Congress), make an excellent adjunct to the text. On the other hand, substantial programming projects that require learning the insides of a particular system take so much homework time that when combined with a broad concepts course they create an overload. Courses with programming projects do work well in follow-on specialized electives, for example, on operating systems, networks, databases, or distributed systems. For this reason, at M.I.T. we assign programming projects in several advanced electives but not in the systems course that is based on this textbook.
支持:有几种在线资源为本教材提供支持。第一种资源是一套课程大纲、阅读清单、问题集、录像讲座、测验和测验解决方案。第二种资源是出版商的网站,该网站专门收集学生、专业读者和教师感兴趣的资源和链接。第三种资源是一个基本开放的网站,供使用本教材的麻省理工学院课程 6.033 的教师和他们当前的学生进行交流。它包含当前或最近教学学期的公告、阅读材料和问题作业。除了当前的课堂交流外,该网站还保存了可追溯到 1995 年的档案,其中包括
Support: Several on-line resources provide support for this textbook. The first of these resources is a set of course syllabi, reading lists, problem sets, videotaped lectures, quizzes, and quiz solutions. A second resource is a Web site of the publisher that is devoted to collecting resources and links of interest to students, professional readers, and instructors. A third resource is a mostly open Web site for communication between instructors of M.I.T. course 6.033, which uses this text, and their current students. It contains announcements, readings, and problem assignments for the current or most recent teaching term. In addition to current class communications, this Web site also holds an archive going back to 1995 that includes
考试和解决方案(这些与教科书的练习和问题集重叠,但也包括有关外部阅读材料的考试问题和答案。)
Examinations and solutions (These overlap the exercises and problem sets of the textbook but they also include exam questions and answers about the outside readings.)
查找所有这些在线资源的说明位于“在哪里可以找到第二部分和其他在线材料”部分。
Instructions for finding all of these on-line resources are in the section “Where to find Part II and other on-line materials”.
因为并非每位教师都想使用教科书的每一章,所以本书的呈现方式(至少在出版时)可谓颇为新颖:前六章收录在这本印刷版书中,作者认为这六章是几乎所有计算机系统课程的核心材料。其余五章可从作者和麻省理工学院在线获取,并根据知识共享许可进行免费、无限制的非商业性使用和混编。在线章节也可在本教科书出版商的网站上获取。核心章节与后面章节之间有许多前向交叉引用。这些交叉引用的标识方式如下例所示:“第 7.4.1 节 [在线] 更详细地探讨了这一主题”。
Because not every instructor may want to use every chapter of the textbook, it is presented in what, at least at the time of publication, may be viewed as a somewhat novel way: The first six chapters, which the authors consider to be the core materials for almost any course about computer systems, appear in this printed book. The remaining five chapters are available on-line from the authors and M.I.T. under a Creative Commons license that permits free, unlimited non-commercial use and remixing. The on-line chapters are also available on the Web site of the publisher of this textbook. There are many forward cross-references from the core chapters to the later chapters. Those cross-references are identified as in this example: “This topic is explored in more detail in Section 7.4.1 [on-line]”.
主题:本书有三个主题。首先,正如书名所示,本书强调了系统设计原则的重要性。当第一次遇到每个设计原则时,它都会以标签和助记标语的形式出现。当再次遇到该设计原则时,它会通过其名称进行识别,并以独特的印刷格式突出显示,以提醒人们其广泛的适用性。本书的封面内页也总结了这些设计原则。第二个主题是本书以网络为中心,在开始的章节中介绍通信和网络,并在后续章节中以此为基础进行构建。第三个主题是本书以安全为中心,在前几章中引入强制模块化,并在后续章节中陆续添加更严格的执行方法。安全章节结束了这本书,这并不是因为它是事后的想法,而是因为它是基于强制模块化的开发的逻辑顶峰。传统的教科书和课程主要将线程和虚拟内存作为资源分配问题来教授。本文主要从提供和执行模块化的角度来探讨这些主题,同时利用多处理器和大地址空间。
Themes: Three themes run through this textbook. First, as suggested by its title, the text emphasizes the importance of systematic design principles. As each design principle is encountered for the first time, it appears in display form with a label and a mnemonic catchphrase. When that design principle is encountered again, it is identified by its name and highlighted with a distinctive print format as a reminder of its wide applicability. The design principles are also summarized on the inside front cover of this book. A second theme is that the text is network-centered, introducing communication and networks in the beginning chapters and building on that base in the succeeding chapters. A third theme is that it is security-centered, introducing enforced modularity in early chapters and adding successively more stringent enforcement methods in succeeding chapters. The security chapter ends the book, not because it is an afterthought, but because it is the logical culmination of a development based on enforced modularity. Traditional texts and courses teach about threads and virtual memory primarily as a resource allocation problem. This text approaches those topics primarily as ways of providing and enforcing modularity, while at the same time taking advantage of multiple processors and large address spaces.
术语和示例:本书确定并发展了几个专业领域中常见的概念和设计原则:软件工程、编程语言、操作系统、分布式系统、网络、数据库系统和机器架构。经验丰富的计算机专业人士可能会发现本书至少有一部分使用的例子、思维方式和术语似乎不寻常,甚至与他们解释自己喜欢的主题的传统方式格格不入。但来自不同专业的工作人员会编制不同的清单来列出看似陌生的内容。原因是,从历史上看,这些专业的工作人员已经确定了原来相同的基本概念和设计原则,但他们使用不同的语言、不同的观点、不同的示例和不同的术语来解释它们。
Terminology and examples: The text identifies and develops concepts and design principles that are common to several specialty fields: software engineering, programming languages, operating systems, distributed systems, networking, database systems, and machine architecture. Experienced computer professionals are likely to find that at least some parts of this text use examples, ways of thinking, and terminology that seem unusual, even foreign to their traditional ways of explaining their favorite topics. But workers from these different specialties will compile different lists of what seems foreign. The reason is that, historically, workers within these specialties have identified what turn out to be identical underlying concepts and design principles, but they have used different language, different perspectives, different examples, and different terminology to explain them.
本文为每个概念选择了作者认为最有效的教学解释和示例,并尽可能采用广泛使用的术语。当不同专业领域使用相互冲突的术语时,词汇表和侧边栏将提供桥梁并讨论术语冲突。结果是一种新颖但根据我们的经验有效的方式,可以向新一代计算机科学和工程专业学生传授计算机系统设计的基本知识。有了这个起点,当学生阅读一本高级书籍或论文或选修高级选修课时,他或她应该能够立即识别出隐藏在专业术语中的熟悉概念。科学家会这样解释这种方法:“物理学与测量单位无关。”类似的原则适用于计算机系统工程:“概念与术语无关”。
This text chooses, for each concept, what the authors believe is the most pedagogically effective explanation and examples, adopting widely used terminology wherever possible. In cases where different specialty areas use conflicting terms, glossaries and sidebars provide bridges and discuss terminology collisions. The result is a novel, but in our experience effective, way of teaching new generations of Computer Science and Engineering students what is fundamental about computer system design. With this starting point, when the student reads an advanced book or paper or takes an advanced elective course, he or she should be able to immediately recognize familiar concepts cloaked in the terminology of the specialty. A scientist would explain this approach by saying “The physics is independent of the units of measurement.” A similar principle applies to the engineering of computer systems: “The concepts are independent of the terminology”.
引文:本书并未使用引文作为识别每个概念或想法的发起者的学术方法;如果这样做,这本书会厚一倍。相反,出现的引文指向作者认为值得了解的相关材料。有一个例外:某些部分专门讲述战争故事,这些故事可能因几代人的复述而被扭曲。这些故事包括引文,旨在识别每个故事的已知来源,以便读者能够评估其有效性。
Citations: The text does not use citations as a scholarly method of identifying the originators of each concept or idea; if it did, the book would be twice as thick. Instead the citations that do appear are pointers to related materials that the authors think are worth knowing about. There is one exception: certain sections are devoted to war stories, which may have been distorted by generations of retelling. These stories include citations intended to identify the known sources of each story, so that the reader has a way to assess their validity.
与 ACM/IEEE 建议的关系: 2001 年和 2004 年的 ACM/IEEE 计算机科学与工程建议描述了两个层次。第一层是构成适当 CSE 教育的一组模块。第二层由几个建议的将这些模块打包成学期课程的方案组成。本书最好被视为模块的独特、现代的打包,有点类似于 ACM/IEEE 计算机科学 2001 建议 CS226c,操作系统和网络(压缩),但增加了命名、容错、原子性以及系统和网络安全的范围。它也有点类似于 ACM/IEEE 计算机工程 2004 建议 CPE D 203,操作系统和网络中心计算,增加了命名、容错、原子性和加密协议的范围。
Relation to ACM/IEEE recommendations: The ACM/IEEE Computer Science and Engineering recommendations of 2001 and 2004 describe two layers. The first layer is a set of modules that constitute an appropriate CSE education. The second layer consists of several suggested packagings of those modules into term-sized courses. This book may be best viewed as a distinct, modern packaging of the modules, somewhat resembling the ACM/IEEE Computer Science 2001 recommendation CS226c, Operating Systems and Networking (compressed), but with the additional scope of naming, fault tolerance, atomicity, and both system and network security. It also somewhat resembles the ACM/IEEE Computer engineering 2004 recommendation CPED203, Operating Systems and Net-Centric computing, with the additional scope of naming, fault tolerance, atomicity, and cryptographic protocols.
第 1 章:系统。本章阐述了作者思考系统的方法的一般理念,并举例说明了计算机系统与其他工程系统的相似之处和不同之处。它还介绍了三个主要思想:(1) 系统设计原则的重要性,(2) 模块化在控制大型系统复杂性方面的作用,以及 (3) 实施模块化的方法。
Chapter 1: Systems. This chapter lays out the general philosophy of the authors on ways to think about systems, with examples illustrating how computer systems are similar to, and different from, other engineering systems. It also introduces three main ideas: (1) the importance of systematic design principles, (2) the role of modularity in controlling complexity of large systems, and (3) methods of enforcing modularity.
第 2 章:计算机系统组织的要素。本章介绍了实现和利用计算机系统模块化的三种关键方法:抽象、命名和层次。对抽象的讨论从系统的角度简单地回顾了计算机架构,创建了一个平台,本书的其余部分将以此为基础,但没有简单重复读者可能已经知道的材料。命名模型是计算机系统模块化的基础,但它通常是留给编程语言设计高级文本的主题。本章以一个案例研究结束,该案例研究介绍了在UNIX文件系统中应用命名、分层和抽象的方式。由于案例研究以一系列伪代码片段的形式展开,因此它既提供了本章概念的具体示例,也为后续章节提供了参考基础。
Chapter 2: Elements of Computer System Organization. This chapter introduces three key methods of achieving and taking advantage of modularity in computer systems: abstraction, naming, and layers. The discussion of abstraction lightly reviews computer architecture from a systems perspective, creating a platform on which the rest of the book builds, but without simple repetition of material that readers probably already know. The naming model is fundamental to how computer systems are modularized, yet it is a subject usually left to advanced texts on programming language design. The chapter ends with a case study of the way in which naming, layering, and abstraction are applied in the UNIX file system. Because the case study develops as a series of pseudocode fragments, it provides both a concrete example of the concepts of the chapter and a basis for reference in later chapters.
第 3 章:命名方案的设计。本章继续讨论系统设计中的命名问题,介绍实用的工程考虑因素,并强调名称在将系统组织为模块集合方面的作用。本章以案例研究和一系列实战故事结束。案例研究使用万维网的统一资源定位器 (URL) 来展示几乎所有命名方案设计考虑因素的示例。实战故事是现实世界命名系统失败的例子,说明了当设计师忽略或不了解设计考虑因素时会出现什么问题。
Chapter 3: Design of Naming Schemes. This chapter continues the discussion of naming in system design by introducing pragmatic engineering considerations and reinforcing the role that names play in organizing a system as a collection of modules. The chapter ends with a case study and a collection of war stories. The case study uses the Uniform Resource Locator (URL) of the World Wide Web to show an example of nearly every naming scheme design consideration. The war stories are examples of failures of real-world naming systems, illustrating what goes wrong when a designer ignores or is unaware of design considerations.
第 4 章:通过客户端和服务实施模块化。前三章阐述了模块化在系统设计中的重要性。本章通过介绍客户端/服务模型开始实施模块化的主题,该模型是一种功能强大且广泛使用的方法,允许模块之间进行交互而不会相互干扰。本章还开始了贯穿本书其余部分的网络中心视角。此时,我们仅将网络视为一个抽象的通信系统,它在客户端和服务之间提供了强大的边界。两个案例研究再次有助于明确这些概念。第一个是 Internet 域名系统 (DNS),它为第 3 章和第 4 章的概念提供了具体的说明。第二个案例第 2 章中UNIX的案例研究为基础,并说明了远程服务对应用程序编程接口语义的影响。
Chapter 4: Enforcing Modularity with Clients and Services. The first three chapters developed the importance of modularity in system design. This chapter begins the theme of enforcing that modularity by introducing the client/service model, which is a powerful and widely used method of allowing modules to interact without interfering with one another. This chapter also begins the network-centric perspective that pervades the rest of the book. At this point, we view the network only as an abstract communication system that provides a strong boundary between client and service. Two case studies again help nail down the concepts. The first is of the Internet Domain Name System (DNS), which provides a concrete illustration of the concepts of both Chapters 3 and 4. The second case study, that of the Sun Network File System (NFS), builds on the case study of the UNIX file system in Chapter 2 and illustrates the impact of remote service on the semantics of application programming interfaces.
第 5 章:通过虚拟化实现模块化。本章将注意力转向通过引入虚拟内存和虚拟处理器(通常称为线程)来实现计算机内的模块化。对于内存和线程,讨论都从具有无限资源的环境开始。虚拟内存的讨论从假设许多线程在无限地址空间中运行开始,然后添加机制以防止线程无意中干扰彼此的数据 - 寻址域和用户/内核模式区别。最后,本文研究了有限的地址空间,这需要引入虚拟地址和地址转换,以及它们产生的地址空间间通信问题。
Chapter 5: Enforcing Modularity with Virtualization. This chapter switches attention to enforcing modularity within a computer by introducing virtual memory and virtual processors, commonly called threads. For both memory and threads, the discussion begins with an environment that has unlimited resources. The virtual memory discussion starts with an assumption of many threads operating in an unlimited address space and then adds mechanisms to prevent threads from unintentionally interfering with one another’s data—addressing domains and the user/kernel mode distinction. Finally, the text examines limited address spaces, which require introducing virtual addresses and address translation, along with the inter-address-space communication problems that they create.
类似地,线程的讨论从假设处理器数量与线程数量相同开始,并集中于协调它们的并发活动。然后讨论实际处理器数量有限的情况,因此也需要线程管理。线程协调的讨论使用了事件计数和序列器,这是一组在实践中并不常见但很自然地适合示例的机制。传统上,线程协调是初次阅读者最难理解的概念之一。然后,问题集邀请读者使用信号量和条件变量来测试他们对原理的理解。
Similarly, the discussion of threads starts with the assumption that there are as many processors as threads, and concentrates on coordinating their concurrent activities. It then moves to the case where a limited number of real processors are available, so thread management is also required. The discussion of thread coordination uses eventcounts and sequencers, a set of mechanisms that are not often seen in practice but that fit the examples in a natural way. Traditionally, thread coordination is among the hardest concepts for the first-time reader to absorb. Problem sets then invite readers to test their understanding of the principles with semaphores and condition variables.
本章以文字和伪代码的形式解释了虚拟内存和线程的概念,有助于阐明抽象概念的实际工作原理,并使用熟悉的现实问题。此外,线程协调的讨论被视为理解原子性的第一步,这是第 9 章 [在线] 的主题。
The chapter explains the concepts of virtual memory and threads both in words and in pseudocode that help clarify how the abstract ideas actually work, using familiar real-world problems. In addition, the discussion of thread coordination is viewed as the first step in understanding atomicity, which is the subject of Chapter 9 [on-line].
本章以案例研究和应用结束。案例研究探讨了强制模块化在 Intel x86 处理器系列中多年来的发展情况。应用是使用虚拟化来创建虚拟机。本章的总体视角是专注于强制模块化而不是资源管理,最大限度地利用当代硬件技术,其中处理器芯片是多核的,地址空间为 64 位宽,可直接寻址的内存量以 GB 为单位。
The chapter ends with a case study and an application. The case study explores how enforced modularity has evolved over the years in the Intel x86 processor family. The application is the use of virtualization to create virtual machines. The overall perspective of this chapter is to focus on enforcing modularity rather than on resource management, taking maximum advantage of contemporary hardware technology, in which processor chips are multicore, address spaces are 64 bits wide, and the amount of directly addressable memory is measured in gigabytes.
第 6 章:性能。本章重点介绍多种计算机系统(包括操作系统、数据库、网络和大型应用程序)中常见的内在性能瓶颈。它探讨了操作系统书籍中的两个传统主题——资源调度和多级内存管理——但强调了在当今世界保持性能优化视角的重要性,在这个世界中,每十年都会带来一些底层硬件功能的千倍提升,而几乎不会影响其他性能指标。为了表明这种不同的观点,调度是用磁盘臂调度问题而不是通常的分时处理器调度程序来说明的。
Chapter 6: Performance. This chapter focuses on intrinsic performance bottlenecks that are found in common across many kinds of computer systems, including operating systems, databases, networks, and large applications. It explores two of the traditional topics of operating systems books—resource scheduling and multilevel memory management—but in a context that emphasizes the importance of maintaining perspective on performance optimization in a world where each decade brings a thousand-fold improvement in some underlying hardware capabilities while barely affecting other performance metrics. As an indication of this different perspective, scheduling is illustrated with a disk arm scheduling problem rather than the usual time-sharing processor scheduler.
第 7 章至第 11 章是在线的,位于本书的第 II 部分。它们的内容在第 369 页的“关于第 II 部分”一节中进行了描述,有关如何查找它们的信息可以在“在哪里可以找到第 II 部分和其他在线资料”中找到。
Chapters 7 through 11 are on-line, in Part II of the book. Their contents are described in the section titled “About Part II” on page 369, and information on how to locate them can be found in “Where to find Part II and other on-line materials”.
进一步阅读的建议。精选阅读清单包括对每本选集值得一读的原因的评论。精选重点在于提供见解的书籍和论文,而不是提供细节的材料。
Suggestions for Further Reading. A selected reading list includes commentary on why each selection is worth reading. The selection emphasis is on books and papers that provide insight rather than materials that provide details.
问题集。作者不仅将考试用作评估方法,也将其用作教学方法。因此,每章末尾的一些练习和书末的问题集(均来自多年来在教授本书内容时进行的考试)远远超出了概念的简单练习。在解决问题的过程中,学生可以探索替代设计,了解教科书中介绍的各种技术,并熟悉实际系统设计中提出或使用的有趣、有时是奇特的想法和方法。问题集通常具有重要的设置,它们会提出需要创造性地应用概念的问题,目的是了解使用这些方法时出现的权衡。
Problem Sets. The authors use examinations not just as a method of assessment, but also as a method of teaching. Therefore, some of the exercises at the end of each chapter and the problem sets at the end of the book (all of which are derived from examinations administered over the years while teaching the material of this textbook) go well beyond simple practice with the concepts. In working the problems out, the student explores alternative designs, learns about variations of techniques seen in the textbook, and becomes familiar with interesting, sometimes exotic, ideas and methods that have been proposed for or used in real system designs. The problem sets generally have significant setup, and they ask questions that require applying concepts creatively, with the goal of understanding the trade-offs that arise in using these methods.
词汇表。如前所述,计算机系统文献来自几个不同的专业,每个专业都开发了自己的系统相关概念词典。本书通篇采用统一的术语,词汇表提供了每个重要专业术语的定义,指出了哪一章介绍了该术语,并且在许多情况下解释了不同专业的不同工作者使用的不同术语。为了完整和方便参考,本书中的词汇表包括第二部分中介绍的术语。
Glossary. As mentioned earlier, the literature of computer systems derives from several different specialties that have each developed their own dictionaries of system-related concepts. This textbook adopts a uniform terminology throughout, and the Glossary offers definitions of each significant term of art, indicates which chapter introduces the term, and in many cases explains different terms used by different workers in different specialties. For completeness and for easy reference, the Glossary in this book includes terms introduced in Part II.
概念索引。索引告诉您在哪里可以找到每个概念的定义性讨论。此外,它还列出了每个设计原则的所有应用。(为了完整起见,它包括了第二部分中介绍的概念,仅列出了章节编号。)
Index of Concepts. The index tells where to find the defining discussion of every concept. In addition, it lists every application of each of the design principles. (For completeness, it includes concepts that are introduced in Part II, listing just the chapter number.)
1.Saltzer 和 Kaashoek 教授以及 MIT OpenCourseWare *免费提供第 7 章至第 11 章的在线版本、附加问题集、词汇表副本以及以每章或每节一个可移植文档格式 (PDF) 文件的形式提供的综合索引,以及包含整个集的单个 PDF 文件。这些材料可以在以下位置找到
http://ocw.mit.edu/Saltzer-Kaashoek
1. Professors Saltzer and Kaashoek and MIT OpenCourseWare* provide, free of charge, on-line versions of Chapters 7 through 11, additional problem sets, a copy of the glossary, and a comprehensive index in the form of one Portable Document Format (PDF) file per chapter or section and also a single PDF file containing the entire set. Those materials can be found at
http://ocw.mit.edu/Saltzer-Kaashoek
2.本印刷书籍的出版商还维护着一系列在线资源,网址为
www.ElsevierDirect.com/9780123749574
点击“配套材料”链接,您将找到本书的第二部分以及其他资源,包括多种格式的文本图表。点击“手册”链接可以找到供教师使用的附加材料(需要注册)。
2. The publisher of this printed book also maintains a set of on-line resources at
www.ElsevierDirect.com/9780123749574
Click on the link “Companion Materials” where you will find Part II of the book as well as other resources, including figures from the text in several formats. Additional materials for instructors (registration required) can be found by clicking the “Manual” link.
3.教学和支持材料可以在以下网址找到
http://ocw.mit.edu/6-033
3. Teaching and support materials can be found at
http://ocw.mit.edu/6-033
4.麻省理工学院目前使用该教科书的课程的网站(包括旧教材的档案)位于
http://mit.edu/6.033(该网站上的一些受版权保护或隐私敏感的材料仅限麻省理工学院的在校学生阅读。)
4. The Web site for the current MIT class that uses this textbook, including the archives of older teaching materials, is at
http://mit.edu/6.033(Some copyrighted or privacy-sensitive materials on that Web site are restricted to current MIT students.)
* MIT 开放式课程计划将许多 MIT 课程的教学材料放在网上,供非商业免费访问,从而帮助制定科学和工程课程的标准。除了第 7 章至第 11 章之外,开放式课程还发布了使用这些材料的 MIT 课程 6.033 的在线材料。因此,有兴趣使用教科书的教师可以在一个地方找到课程大纲、阅读清单、问题集、录像讲座、测验和解决方案。
* The M.I.T. OpenCourseWare initiative places on-line, for non-commercial free access, teaching materials from many M.I.T. courses, and thus is helping set a standard for curricula in science and engineering. In addition to Chapters 7 through 11, OpenCourseWare publishes on-line materials for the M.I.T. course that uses these materials, 6.033. Thus, an instructor interested in making use of the textbook can find in one place course syllabi, reading lists, problem sets, videotaped lectures, quizzes, and solutions.
本教材最初是麻省理工学院电气工程与计算机科学系自 1968 年开始开设的高级本科课程《计算机系统工程》(6.033,原为 6.233)的一套笔记。四十年来,许多教职员工、来访者、复习老师、助教和学生对本教材提出了许多意见和建议,这些意见和建议对本教材大有裨益。超过 5,000 名学生使用过(并深受其害)初稿,本教材的写作参考了他们对本教材学习体验的观察(以及本教材经常引起的困惑)。我们非常感谢这些贡献。此外,本教材的某些方面值得特别感谢。
This textbook began as a set of notes for the advanced undergraduate course Engineering of Computer Systems (6.033, originally 6.233), offered by the Department of Electrical Engineering and Computer Science of the Massachusetts Institute of Technology starting in 1968. The text has benefited from four decades of comments and suggestions by many faculty members, visitors, recitation instructors, teaching assistants, and students. Over 5,000 students have used (and suffered through) draft versions, and observations of their learning experiences (as well as frequent confusion caused by the text) have informed the writing. We are grateful for those many contributions. In addition, certain aspects deserve specific acknowledgment.
1.命名(第 2.2 节和第 3 章)有关命名的材料的概念和组织源于与 Michael D. Schroeder 的广泛讨论。命名模型(以及我们开发的部分模型)密切遵循 D. Austin Henderson 在其博士论文中开发的模型。Stephen A. Ward 建议了一些有用的命名模型概括,而 Roger Needham 针对该材料的早期版本提出了几个概念。早期版本(包括应用于寻址体系结构和文件系统的命名模型的详细示例和历史参考书目)作为第3 章发表在 Rudolf Bayer 等人编辑的《操作系统:高级课程,计算机科学讲义 60》,第 99–208 页。Springer-Verlag,1978 年,1984 年重印。许多其他人也贡献了其他想法,包括 Ion Stoica、Karen Sollins、Daniel Jackson、Butler Lampson、David Karger 和 Hari Balakrishnan。
1. Naming (Section 2.2 and Chapter 3) The concept and organization of the materials on naming grew out of extensive discussions with Michael D. Schroeder. The naming model (and part of our development) follows closely the one developed by D. Austin Henderson in his Ph.D. thesis. Stephen A. Ward suggested some useful generalizations of the naming model, and Roger Needham suggested several concepts in response to an earlier version of this material. That earlier version, including in-depth examples of the naming model applied to addressing architectures and file systems, and an historical bibliography, was published as Chapter 3 in Rudolf Bayer et al., editors, Operating Systems: An Advanced Course, Lecture Notes in Computer Science 60, pages 99–208. Springer-Verlag, 1978, reprinted 1984. Additional ideas have been contributed by many others, including Ion Stoica, Karen Sollins, Daniel Jackson, Butler Lampson, David Karger, and Hari Balakrishnan.
2.强制模块化和虚拟化(第 4 章和第5 章) 第 4 章深受 David L. Tennenhouse 关于同一主题的讲座的影响。这两章都得到了 Hari Balakrishnan、Russ Cox、Michael Ernst、Eddie Kohler、Chris Laas、Barbara H. Liskov、Nancy Lynch、Samuel Madden、Robert T. Morris、Max Poletto、Martin Rinard、Susan Ruff、Gerald Jay Sussman、Julie Sussman 和 Michael Walfish 的大量反馈,并因此得到了改进。
2. Enforced Modularity and Virtualization (Chapters 4 and 5) Chapter 4 was heavily influenced by lectures on the same topic by David L. Tennenhouse. Both chapters have been improved by substantial feedback from Hari Balakrishnan, Russ Cox, Michael Ernst, Eddie Kohler, Chris Laas, Barbara H. Liskov, Nancy Lynch, Samuel Madden, Robert T. Morris, Max Poletto, Martin Rinard, Susan Ruff, Gerald Jay Sussman, Julie Sussman, and Michael Walfish.
3.网络(第 7 章 [在线]) 与 David D. Clark 和 David L. Tennenhouse 的对话对本章的组织起到了重要作用,Clark 的讲座是部分演讲的基础。Robert H. Halstead Jr. 撰写了关于网络的早期草稿,他的一些想法也被借鉴。Hari Balakrishnan 提供了许多建议和更正,并帮助理清了混乱的解释,Julie Sussman 和 Susan Ruff 指出了许多改进演讲的机会。关于拥塞控制的材料是在与 Hari Balakrishnan 和 Robert T. Morris 进行广泛讨论后开发的,部分基于 Raj Jain 的想法。
3. Networks (Chapter 7 [on-line]) Conversations with David D. Clark and David L. Tennenhouse were instrumental in laying out the organization of this chapter, and lectures by Clark were the basis for part of the presentation. Robert H. Halstead Jr. wrote an early draft set of notes about networking, and some of his ideas have also been borrowed. Hari Balakrishnan provided many suggestions and corrections and helped sort out muddled explanations, and Julie Sussman and Susan Ruff pointed out many opportunities to improve the presentation. The material on congestion control was developed with the help of extensive discussions with Hari Balakrishnan and Robert T. Morris, and is based in part on ideas from Raj Jain.
4.容错(第 8 章[在线] ) 本章中的大部分概念和示例最初由克劳德·香农、爱德华·F·摩尔、戴维·霍夫曼、爱德华·J·麦克拉斯基、巴特勒·W·兰普森、丹尼尔·P·西维奥雷克和吉姆·N·格雷提出。
4. Fault Tolerance (Chapter 8 [on-line]) Most of the concepts and examples in this chapter were originally articulated by Claude Shannon, Edward F. Moore, David Huffman, Edward J. McCluskey, Butler W. Lampson, Daniel P. Siewiorek, and Jim N. Gray.
5.事务和一致性(第 9 章 [在线] 和第 10 章 [在线]) 事务和一致性章节的内容是四十年来在众多来源的帮助和想法的帮助下发展起来的。版本历史的概念是 Jack Dennis 提出的,而这里开发的全有或全无和前后原子性以及版本历史的具体形式则归功于 David P. Reed。Jim N. Gray 不仅提出了这两章中描述的许多想法,他还提供了广泛的评论。(这并不意味着认可——他强烈反对某些想法的重要性!)其他有用的评论和建议来自 Hari Balakrishnan、Andrew Herbert、Butler W. Lampson、Barbara H. Liskov、Samuel R. Madden、Larry Rudolph、Gerald Jay Sussman 和 Julie Sussman。
5. Transactions and Consistency (Chapters 9 [on-line] and 10 [on-line]) The material of the transactions and consistency chapters has been developed over the course of four decades with aid and ideas from many sources. The concept of version histories is due to Jack Dennis, and the particular form of all-or-nothing and before-or-after atomicity with version histories developed here is due to David P. Reed. Jim N. Gray not only came up with many of the ideas described in these two chapters, he also provided extensive comments. (That doesn’t imply endorsement—he disagreed strongly about the importance of some of the ideas!) Other helpful comments and suggestions were made by Hari Balakrishnan, Andrew Herbert, Butler W. Lampson, Barbara H. Liskov, Samuel R. Madden, Larry Rudolph, Gerald Jay Sussman, and Julie Sussman.
6.计算机安全(第 11 章 [在线]) 11.1 和 11.6 节大量引用了 Jerome H. Saltzer 和 Michael D. Schroeder 的论文“计算机系统中的信息保护”,IEEE 63 论文集,第 9 期(1975 年 9 月),第 1278-1308 页。Ronald Rivest、David Mazières 和 Robert T. Morris 对本章中介绍的材料做出了重要贡献。Brad Chen、Michael Ernst、Kevin Fu、Charles Leiserson、Susan Ruff 和 Seth Teller 提出了许多改进文本的建议。
6. Computer Security (Chapter 11 [on-line]) Sections 11.1 and 11.6 draw heavily from the paper “The protection of information in computer systems” by Jerome H. Saltzer and Michael D. Schroeder, Proceedings of the IEEE 63, 9 (September, 1975), pages 1278–1308. Ronald Rivest, David Mazières, and Robert T. Morris made significant contributions to material presented throughout the chapter. Brad Chen, Michael Ernst, Kevin Fu, Charles Leiserson, Susan Ruff, and Seth Teller made numerous suggestions for improving the text.
7.建议课外阅读建议阅读的想法来自多方。特别要感谢 Michael D. Schroeder,他发现了计算机科学之外的几篇经典系统论文,而其他人可能想不到这些论文;Edward D. Lazowska,他提供了华盛顿大学使用的广泛阅读书目;以及 Butler W. Lampson,他对书目进行了深思熟虑的评论。
7. Suggested Outside Readings Ideas for suggested readings have come from many sources. Particular thanks must go to Michael D. Schroeder, who uncovered several of the classic systems papers in places outside computer science where nobody else would have thought to look; Edward D. Lazowska, who provided an extensive reading list used at the University of Washington; and Butler W. Lampson, who provided a thoughtful review of the list.
8.练习和问题集每章末尾的练习和书末的问题集都是由许多不同的教职员工、讲师、助教和本科生在 40 多年的时间里,在教授课本内容的同时设计测验和考试的过程中收集、建议、尝试、调试和修改的。某些较长的练习和大多数问题集(基于导言并包含几个相关问题)都是由一个人付出大量努力的结果。对于那些不是由作者之一开发的问题集,会在问题集第一页的脚注中注明作者姓名。每个问题或问题集后面都有一个标识符,格式为“ 1978-3-14 ”。此标识符报告该问题某个版本首次出现的年份、考试编号和考试问题编号。
8. The Exercises and Problem Sets The exercises at the end of each chapter and the problem sets at the end of the book have been collected, suggested, tried, debugged, and revised by many different faculty members, instructors, teaching assistants, and undergraduate students over a period of 40 years in the process of constructing quizzes and examinations while teaching the material of the text.Certain of the longer exercises and most of the problem sets, which are based on lead-in stories and include several related questions, represent a substantial effort by a single individual. For those problem sets not developed by one of the authors, a credit line appears in a footnote on the first page of the problem set.Following each problem or problem set is an identifier of the form “1978–3–14”. This identifier reports the year, examination number, and problem number of the examination in which some version of that problem first appeared.
杰罗姆·H·索尔策
Jerome H. Saltzer
M.Frans Kaashoek
M. Frans Kaashoek
2009
2009
WHERE TO FIND PART II AND OTHER MATERIALS
1.4. Computer Systems are the Same But Different
1.5. Coping with Complexity II
What the Rest of this Book is About
Chapter 2. Elements of Computer System Organization
2.1. The Three Fundamental Abstractions
2.2. Naming in Computer Systems
2.3. Organizing Computer Systems with Names and Layers
2.5. Case study: UNIX® file system layering and naming
Chapter 3. The Design of Naming Schemes
3.1. Considerations in the design of naming schemes
3.2. Case Study: The Uniform Resource Locator (URL)
3.3. War stories: Pathologies in the use of names
Chapter 4. Enforcing Modularity with Clients and Services
4.1. Client/Service Organization
4.2. Communication Between Client and Service
4.3. Summary and the Road Ahead
4.4. Case Study: The Internet Domain Name System (DNS)
4.5. Case Study: The Network File System (NFS)
Chapter 5. Enforcing Modularity with Virtualization
5.1. Client/Server Organization within a Computer Using Virtualization
5.2. 使用 SEND、RECEIVE 和有界缓冲区的虚拟链接
5.2. Virtual Links Using SEND, RECEIVE, and a Bounded Buffer
5.3. Enforcing Modularity in Memory
5.5. Virtualizing Processors Using Threads
5.6. Thread Primitives for Sequence Coordination
5.7. Case Study: Evolution of Enforced Modularity in the Intel X86
5.8. Application: Enforcing Modularity Using Virtual Machines
6.1. Designing for Performance
APPENDIX A. The Binary Classification Trade-Off
1.1系统与复杂性
1.1.1各领域系统的常见问题
1.1.1 Common Problems of Systems in Many Fields
1.1.2系统、组件、接口和环境
1.1.2 Systems, Components, Interfaces, and Environments
1.1.3复杂
1.1.3 Complexity
1.2复杂性的来源
1.3应对复杂性 I
1.3.1模块化
1.3.1 Modularity
1.3.2抽象
1.3.2 Abstraction
1.3.3分层
1.3.3 Layering
1.3.4等级制度
1.3.4 Hierarchy
1.3.5重新组合:名字带来联系
1.4计算机系统相同但不同
1.4 Computer Systems are the Same but Different
1.4.1计算机系统在组合上没有近似的界限
1.4.1 Computer Systems have no Nearby Bounds on Composition
1.4.2d(技术)/dt 是前所未有的
1.5应对复杂性 II
1.5.1为什么模块化、抽象、分层和层次结构还不够
1.5.1 Why Modularity, Abstraction, Layering, and Hierarchy aren’t Enough
1.5.2迭代
1.5.2 Iteration
1.5.3把事情简单化
1.5.3 Keep it Simple
本书是关于计算机系统的,本章介绍了一些用于设计计算机系统的词汇和概念。它还介绍了“系统视角”,这是一种全局性和包容性的系统思维方式,而不是专注于特定问题。对这种思维方式的全面理解无法在简短的总结中真正体现出来,因此本章实际上只是对后续章节中将深入阐述的思想的预览。
This book is about computer systems, and this chapter introduces some of the vocabulary and concepts used in designing computer systems. It also introduces “systems perspective”, a way of thinking about systems that is global and encompassing rather than focused on particular issues. A full appreciation of this way of thinking can’t really be captured in a short summary, so this chapter is actually just a preview of ideas that will be developed in depth in succeeding chapters.
计算机科学与工程的常规学习课程从描述计算的语言结构(软件)和实现计算的物理结构(硬件)开始。然后,它逐渐分支,例如,专注于计算理论、人工智能或系统设计,而系统设计本身通常分为以下几个专业:操作系统、事务和数据库系统、计算机体系结构、软件工程、编译器、计算机网络、安全性和可靠性。我们不会立即攻克其中一个专业,而是假设读者已经完成了软件和硬件的入门课程,然后我们开始对支持整个系统专业的计算机系统进行广泛的研究。
The usual course of study of computer science and engineering begins with linguistic constructs for describing computations (software) and physical constructs for realizing computations (hardware). It then branches, focusing, for example, on the theory of computation, artificial intelligence, or the design of systems, which itself is usually divided into specialities: operating systems, transaction and database systems, computer architecture, software engineering, compilers, computer networks, security, and reliability. Rather than immediately tackling one of those specialties, we assume that the reader has completed the introductory courses on software and hardware, and we begin a broad study of computer systems that supports the entire range of systems specialties.
计算机的许多有趣的应用都需要
Many interesting applications of computers require
coordination of concurrent activities
geographically separated but linked data
vast quantities of stored information
为了开发满足这些要求的应用程序,设计人员必须超越软件和硬件,将计算机系统视为一个整体。在此过程中,设计人员会遇到许多新问题 — — 问题如此之多,以至于计算机系统范围的限制通常既不是来自物理定律,也不是来自理论上的不可能性,而是来自人类理解的局限性。
To develop applications that have these requirements, the designer must look beyond the software and hardware and view the computer system as a whole. In doing so, the designer encounters many new problems—so many that the limit on the scope of computer systems generally arises neither from laws of physics nor from theoretical impossibility, but rather from limitations of human understanding.
其中一些相同的问题在其他系统中也有对应物,或至少是类似物,而这些系统至多只是偶然涉及计算机。系统研究是计算机工程可以利用其他工程领域知识的一个领域:土木工程(桥梁和摩天大楼)、城市规划(城市设计)、机械工程(汽车和空调)、航空航天、电气工程,甚至生态学和政治学。我们首先研究一些常见问题。然后,我们将研究计算机系统提出完全不同问题的两种方式。如果有些例子是你从未遇到过或只是模糊地意识到的事情,不要担心。这些例子的唯一目的是说明不同类型系统之间的考虑范围和相似之处。
Some of these same problems have counterparts, or at least analogs, in other systems that have, at most, only incidental involvement of computers. The study of systems is one place where computer engineering can take advantage of knowledge from other engineering areas: civil engineering (bridges and skyscrapers), urban planning (the design of cities), mechanical engineering (automobiles and air conditioning), aviation and space flight, electrical engineering, and even ecology and political science. We start by looking at some of those common problems. Then we will examine two ways in which computer systems pose problems that are quite different. Don’t worry if some of the examples are of things you have never encountered or are only dimly aware of. The sole purpose of the examples is to illustrate the range of considerations and similarities across different kinds of systems.
数个世纪以来积累的关于制度的很多智慧都以民间传说、格言、警句和名言的形式流传下来。其中一些智慧被记录在这些页面底部的方框中。
Much wisdom about systems that has accumulated over the centuries is passed along in the form of folklore, maxims, aphorisms and quotations. Some of that wisdom is captured in the boxes at the bottom of these pages.
一切都应该尽可能简单,但不能过于简单。
Everything should be made as simple as possible, but no simpler.
— 通常被认为是阿尔伯特·爱因斯坦说的;这实际上是对他 1933 年牛津演讲中评论的释义。
— commonly attributed to Albert Einstein; it is actually a paraphrase of a comment he made in a 1933 lecture at Oxford.
在本章乃至整本书中,我们将指出一系列系统设计原则,这些原则是通常适用于各种情况的经验法则。设计原则不是一成不变的法则,而是汇集智慧和经验的指导方针,可以帮助设计师避免犯错。精明的读者很快就会意识到,有时不同的设计原则之间存在着紧张关系,甚至矛盾。然而,如果设计师发现自己违反了设计原则,最好仔细检查一下情况。
As we proceed in this chapter and throughout the book, we shall point out a series of system design principles, which are rules of thumb that usually apply to a diverse range of situations. Design principles are not immutable laws, but rather guidelines that capture wisdom and experience and that can help a designer avoid making mistakes. The astute reader will quickly realize that sometimes a tension, even to the point of contradiction, exists between different design principles. Nevertheless, if a designer finds that he or she is violating a design principle, it is a good idea to review the situation carefully.
第一次遇到设计原则时,文本会将其突出显示。以下是第 16 页上的一个例子。
At the first encounter of a design principle, the text displays it prominently. Here is an example, found on page 16.
避免过于笼统
Avoid Excessive Generality
如果它对一切都有好处,那它就毫无用处。
If it’s good for everything, it’s good for nothing.
因此,每项设计原则都有一个正式的标题(“避免过度概括”)和一个简短的非正式描述(“如果它有利于……”),旨在帮助回忆该原则。大多数设计原则会在不同的上下文中出现多次,这就是它们有用的原因之一。文本突出显示了原则的后续遭遇,例如:避免过度概括。书中所有设计原则的列表可在封面内页和索引中的“设计原则”下找到。
Each design principle thus has a formal title (“Avoid excessive generality”) and a brief informal description (“If it’s good for …”), which are intended to help recall the principle. Most design principles will show up several times, in different contexts, which is one reason why they are useful. The text highlights later encounters of a principle such as: avoid excessive generality. A list of all of the design principles in the book can be found on the inside front cover and also in the index, under “Design principles”.
本章的其余部分讨论系统的常见问题、问题的根源以及解决这些问题的技术。
The remaining sections of this chapter discuss common problems of systems, the sources of those problems, and techniques for coping with them.
在这些种类的系统中遇到的问题可以分为四类:突发属性、影响传播、不相称的扩展和权衡。
The problems one encounters in these many kinds of systems can usefully be divided into four categories: emergent properties, propagation of effects, incommensurate scaling, and trade-offs.
追求简单并且不信任它。
Seek simplicity and distrust it.
- 阿尔弗雷德·诺斯·怀特黑德,《自然的概念》(1920)
— Alfred North Whitehead, The Concept of Nature (1920)
突现特性是指在系统的单个组件中不明显但在组合这些组件时会显示出来的特性,因此它们也可以称为意外。大多数系统中都存在突现特性,尽管人们总是会争论是否事先对组件进行了足够仔细的分析,是否能够预测到意外。明智的做法是避免这种争论,而是专注于一个无法改变的事实:有些事情只有在系统构建时才会出现。
Emergent properties are properties that are not evident in the individual components of a system, but show up when combining those components, so they might also be called surprises. Emergent properties abound in most systems, although there can always be a (fruitless) argument about whether or not careful enough prior analysis of the components might have allowed prediction of the surprise. It is wise to avoid this argument and instead focus on an unalterable fact of life: some things turn up only when a system is built.
一些涌现特性的例子是众所周知的。委员会或陪审团的行为常常让外部观察者感到惊讶。该群体会形成一种思维方式,而这种思维方式是无法根据对个人的了解来预测的。(涌现特性的概念和名称起源于社会学。)当伦敦泰晤士河上供行人通行的千禧桥开通时,其设计者不得不在几天后将其关闭。他们惊讶地发现,当桥梁摇晃时,行人会同步他们的脚步,导致桥梁摇晃得更厉害。几家电力公司的互连允许负载共享有助于减少停电频率,但当故障最终发生时,可能会导致整个互连结构瘫痪。政治上的意外是,受影响的客户数量可能大到足以引起政府监管机构不必要的关注。
Some examples of emergent properties are well known. The behavior of a committee or a jury often surprises outside observers. The group develops a way of thinking that could not have been predicted from knowledge about the individuals. (The concept of—and the label for—emergent properties originated in sociology.) When the Millennium Bridge for pedestrians over the River Thames in London opened, its designers had to close it after only a few days. They were surprised to discover that pedestrians synchronize their footsteps when the bridge sways, causing it to sway even more. Interconnection of several electric power companies to allow load sharing helps reduce the frequency of power failures, but when a failure finally occurs it may take down the entire interconnected structure. The political surprise is that the number of customers affected may be large enough to attract the unwanted attention of government regulators.
电力互联互通也说明了第二类系统问题,即影响传播,即一棵树倒在俄勒冈州的一条电线上,导致 1000 英里外的新墨西哥州的灯熄灭。乍一看只是小规模的干扰或局部变化,其影响可能会从系统的一端传到另一端。大多数系统设计中的一个重要要求是限制故障的影响。作为影响传播的另一个例子,考虑一位汽车设计师决定将量产车型的轮胎尺寸从 13 英寸改为 15 英寸。作出这一改变的原因可能是为了改善驾驶体验。进一步分析,这一改变会导致许多其他改变:重新设计轮舱、扩大备胎空间、重新布置放置备胎的后备箱以及将后座稍微向前移动以适应后备箱的重新设计。座椅的改变使得后座的膝部空间变得太小,因此座椅的靠背必须做得更薄,这反过来又降低了舒适度,而舒适度正是改变轮胎尺寸的最初原因,而且在发生碰撞时还可能降低安全性。后备箱和后座设计的额外重量意味着现在需要更硬的后弹簧。必须修改后轴速比,以保持车轮向路面传递的力正确,并且必须更改车速表传动装置以与新的轮胎尺寸和轴速比相一致。
The electric power inter-tie also illustrates the second category of system problems—propagation of effects—when a tree falling on a power line in Oregon leads to the lights going out in New Mexico, 1000 miles away. What looks at first to be a small disruption or a local change can have effects that reach from one end of a system to the other. An important requirement in most system designs is to limit the impact of failures. As another example of propagation of effects, consider an automobile designer’s decision to change the tire size on a production model car from 13 to 15 inches. The reason for making the change might have been to improve the ride. On further analysis, this change leads to many other changes: redesigning the wheel wells, enlarging the spare tire space, rearranging the trunk that holds the spare tire, and moving the back seat forward slightly to accommodate the trunk redesign. The seat change makes knee room in the back seat too small, so the backs of the seats must be made thinner, which in turn reduces the comfort that was the original reason for changing the tire size, and it may also reduce safety in a collision. The extra weight of the trunk and rear seat design means that stiffer rear springs are now needed. The rear axle ratio must be modified to keep the force delivered to the road by the wheels correct, and the speedometer gearing must be changed to agree with the new tire size and axle ratio.
这些影响是显而易见的。在复杂的系统中,随着分析的继续,通常会出现更遥远和更微妙的影响。作为一个典型的例子,汽车制造商可能会发现德克萨斯州的全州采购办公室目前没有大尺寸替换轮胎的认证供应商。因此,德克萨斯州政府可能在两年内不会有汽车销售,而两年正是将供应商添加到认证名单上所需的时间。民间智慧将影响的传播描述为:“大系统中没有小变化”。
Those effects are the obvious ones. In complicated systems, as the analysis continues, more distant and subtle effects normally appear. As a typical example, the automobile manufacturer may find that the statewide purchasing office for Texas does not currently have a certified supplier for replacement tires of the larger size. Thus there will probably be no sales of cars to the Texas government for two years, which is the length of time it takes to add a supplier onto the certified list. Folk wisdom characterizes propagation of effects as: “There are no small changes in a large system”.
我们的生活被细节所浪费……简单,简单,简单!
Our life is frittered away by detail … simplicity, simplicity, simplicity!
- 亨利·戴维·梭罗,《瓦尔登湖;或森林生活》(1854 年)
— Henry David Thoreau, Walden; or, Life in the Woods (1854)
系统研究中遇到的第三个典型问题是不相称的扩展:随着系统规模或速度的增加,并非所有部分都遵循相同的扩展规则,因此系统会停止运行。这个问题的数学描述是系统的不同部分表现出不同的增长顺序。以下是一些例子:
The third characteristic problem encountered in the study of systems is incommensurate scaling: as a system increases in size or speed, not all parts of it follow the same scaling rules, so things stop working. The mathematical description of this problem is that different parts of the system exhibit different orders of growth. Some examples:
伽利略曾观察到,“除非……大大改变其四肢的比例,尤其是骨骼的比例,使其比常人高出很多,否则大自然不可能创造出……比常人高出十倍的巨人。”(《关于两门新科学的论述和数学证明》,第二天,莱顿,1638 年)]。在 1928 年的经典论文《论合适的尺寸》[见进一步阅读建议 1.4.1 ]中,JBS 霍尔丹举了老鼠的例子,如果将老鼠放大到大象的大小,它会因自身重量而倒下。这两个例子的原因是,重量随体积增加,而体积与线性尺寸的立方成正比,但骨骼强度主要取决于横截面积,而骨骼强度仅随线性尺寸的平方增加。因此,真正的大象所需的骨骼结构与放大后的老鼠截然不同。
Galileo observed that “nature cannot produce a … giant ten times taller than an ordinary man unless by … greatly altering the proportions of his limbs and especially of his bones, which would have to be considerably enlarged over the ordinary” [Discourses and Mathematical Demonstrations on Two New Sciences, second day, Leiden, 1638]. In a classic 1928 paper, “On being the right size” [see Suggestions for Further Reading 1.4.1], J. B. S. Haldane uses the example of a mouse, which, if scaled up to the size of an elephant, would collapse of its own weight. For both examples, the reason is that weight grows with volume, which is proportional to the cube of linear size, but bone strength, which depends primarily on cross-sectional area, grows only with the square of linear size. Thus a real elephant requires a skeletal arrangement that is quite different from that of a scaled-up mouse.
埃及建筑师斯尼夫鲁曾试图建造越来越大的金字塔。不幸的是,美杜姆金字塔的外墙脱落,达舒尔金字塔墓室的天花板也开裂了。后来,他想出可以通过降低金字塔高宽比,将金字塔扩大到吉萨金字塔的大小。这个解决方案奏效的原因似乎从未被彻底分析过,但似乎很可能与不相称的缩放有关——金字塔的重量随着其线性尺寸的立方而增加,而用于建造墓室天花板的岩石的强度仅随着其横截面积的增加而增加,而横截面积随着正方形而增加。
The Egyptian architect Sneferu tried to build larger and larger pyramids. Unfortunately, the facing fell off the pyramid at Meidum, and the ceiling of the burial chamber of the pyramid at Dashur cracked. He later figured out that he could escalate a pyramid to the size of the pyramids at Giza by lowering the ratio of the pyramid’s height to its width. The reason this solution worked has apparently never been completely analyzed, but it seems likely that incommensurate scaling was involved—the weight of a pyramid increases with the cube of its linear size, while the strength of the rock used to create the ceiling of a burial chamber increases only with the area of its cross-section, which grows with the square.
现代超级油轮的船长发现,这艘船太大了,全速行驶时需要行驶 12 英里才能直线停下来 - 但从船桥上看,12 英里超出了地平线(详情见边栏 1.1 )。
The captain of a modern oil supertanker finds that the ship is so massive that when underway at full speed it takes 12 miles to bring it to a straight line stop—but 12 miles is beyond the horizon as viewed from the ship’s bridge (see Sidebar 1.1 for the details).
摩天大楼的高度受到较低楼层面积的限制,这些面积必须用于提供通往上层的通道。所需的通道面积(例如,电梯和楼梯)与在较高楼层办公的人数成正比。该数字又与较高楼层数乘以每层可用面积成正比。如果所有楼层的面积相同,并且楼层数增加,那么在某个时候,底层将完全用于提供通往较高楼层的通道,因此底层没有任何附加价值(除了可以吹嘘建筑物的高度)。实际上,办公房地产的经济学规定,最低楼层的通道面积不得超过 25%。
The height of a skyscraper is limited by the area of lower floors that must be devoted to providing access to the floors above. The amount of access area required (for example, for elevators and stairs) is proportional to the number of people who have offices on higher floors. That number is in turn proportional to the number of higher floors multiplied by the usable area of each floor. If all floors have the same area, and the number of floors increases, at some point the bottom floor will be completely used up providing access to higher floors, so the bottom floor provides no added value (apart from being able to brag about the building’s height). In practice, the economics of office real estate dictate that no more than 25% of the lowest floor be devoted to access.
我们因过度深奥而困惑并削弱思想。
By undue profundity we perplex and enfeeble thought.
- 埃德加·爱伦·坡,《莫尔格街凶杀案》(1841年)
— Edgar Allan Poe, “The Murders in the Rue Morgue” (1841)
边栏 1.1 阻止超级油轮
Sidebar 1.1 Stopping a Supertanker
一点几何知识就能发现,到视界线的距离与桥梁高度的平方根成正比。该高度(大概)随超级油轮线性尺寸的一次方而增长。停止或转向超级油轮所需的能量与其质量成正比,而质量随其线性尺寸的三次方而增长。传递停止或转向能量所需的时间尚不清楚,但推动方向舵和倒转螺旋桨是唯一可用的工具,这两种工具的表面积都随线性尺寸的平方而增长。
A little geometry reveals that the distance to the visual horizon is proportional to the square root of the height of the bridge. That height (presumably) grows with the first power of the supertanker’s linear dimension. The energy required to stop or turn a supertanker is proportional to its mass, which grows with the third power of its linear dimensions. The time required to deliver the stopping or turning energy is less clear, but pushing on the rudder and reversing the propellers are the only tools available, and both of those have surface area that grows with the square of the linear dimension.
底线是这样的:如果我们将油轮的线性尺寸增加一倍,动量就会增加 8 倍,而提供停止或转弯能量的能力只会增加 4 倍,所以我们需要看得更远。不幸的是,地平线的距离只有 1.414 倍。不可避免的是,在某个尺寸下,视觉导航必然会失效。
Here is the bottom line: if we double the tanker’s linear dimensions, the momentum goes up by a factor of 8, and the ability to deliver stopping or turning energy goes up by only a factor of 4, so we need to see twice as far ahead. Unfortunately, the horizon will be only 1.414 times as far away. Inevitably, there is some size for which visual navigation must fail.
大多数系统都会出现不相称的缩放。它通常是限制单个系统设计能够处理的大小或速度范围的因素。另一方面,我们必须谨慎对待缩放论据。它们在 20 世纪初被用来支持制造飞机是浪费时间的说法(见边栏 1.2)。
Incommensurate scaling shows up in most systems. It is usually the factor that limits the size or speed range that a single system design can handle. On the other hand, one must be cautious with scaling arguments. They were used at the beginning of the twentieth century to support the claim that it was a waste of time to build airplanes (see Sidebar 1.2).
边栏 1.2 飞机为什么不能飞
Sidebar 1.2 Why Airplanes can’t Fly
飞机的重量会随着其线长三次方增加,但与表面积成正比的升力却只能以二次方增加。即使能造出一架小型飞机,大型飞机也永远无法起飞。
The weight of an airplane grows with the third power of its linear dimension, but the lift, which is proportional to surface area, can grow only with the second power. Even if a small plane can be built, a larger one will never get off the ground.
1900 年左右,物理学家和工程师都使用这种推理来论证制造重于空气的机器是浪费时间。亚历山大·格雷厄姆·贝尔 (Alexander Graham Bell) 于 1902 年夏天在缅因州放飞箱形风筝,证明了这一论点并非事实的全部。他在实验中将两个箱形风筝并排连接在一起,这种配置使升力表面积增加了一倍,但也允许在两个风筝接触的地方去除多余的材料和支撑。因此,随着规模的增加,升力重量比实际上有所改善。贝尔在“风筝结构中的四面体原理”中发表了他的研究结果[参见进一步阅读建议 1.4.2 ]。
This line of reasoning was used around 1900 by both physicists and engineers to argue that it was a waste of time to build heavier-than-air machines. Alexander Graham Bell proved that this argument wasn’t the whole story by flying box kites in Maine in the summer of 1902. In his experiments he attached two box kites side by side, a configuration that doubled the lifting surface area, but also allowed removal of the redundant material and supports where the two kites touched. Thus, the lift-to-weight ratio actually improved as the scale increased. Bell published his results in “The tetrahedral principle in kite structure” [see Suggestions for Further Reading 1.4.2].
系统设计的第四个问题是许多约束以权衡的形式出现。权衡的一般模型始于这样的观察:宇宙中某种形式的善是有限的,而设计挑战首先是最大化这种善,其次是避免浪费它,第三是将其分配到最有帮助的地方。一种常见的权衡形式有时被称为水床效应:在某一点上压低一个问题会导致另一个问题在其他地方出现。例如,人们通常可以推动硬件电路以更高的时钟速率运行,但这种变化会增加功耗和时序错误的风险。通过使电路物理上更小,可以降低时序错误的风险,但这样一来,可用于消散功耗增加引起的热量的面积就会减少。另一种常见的权衡形式出现在二元分类中,例如,在烟雾探测器、垃圾邮件(不需要的商业电子邮件)过滤器、数据库查询和身份验证设备的设计中。二元分类的一般模型是,我们希望根据某些属性的存在与否将一组事物分为两类,但我们缺乏对该属性的直接衡量标准。因此,我们改为确定并使用一些间接衡量标准,即代理。有时,这种方案会错误分类某些事物。通过调整代理的参数,设计人员可能能够减少一类错误(对于烟雾探测器,未发现的火灾;对于垃圾邮件过滤器,合法邮件被标记为垃圾邮件),但代价是增加其他一些错误(对于烟雾探测器,误报;对于垃圾邮件过滤器,垃圾邮件被标记为合法邮件)。附录 A更详细地探讨了二元分类权衡。系统设计人员的大部分智力努力都用于评估各种权衡。
The fourth problem of system design is that many constraints present themselves as trade-offs. The general model of a trade-off begins with the observation that there is a limited amount of some form of goodness in the universe, and the design challenge is first to maximize that goodness, second to avoid wasting it, and third to allocate it to the places where it will help the most. One common form of trade-off is sometimes called the waterbed effect: pushing down on a problem at one point causes another problem to pop up somewhere else. For example, one can typically push a hardware circuit to run at a higher clock rate, but that change increases both power consumption and the risk of timing errors. It may be possible to reduce the risk of timing errors by making the circuit physically smaller, but then less area will be available to dissipate the heat caused by the increased power consumption. Another common form of trade-off appears in binary classification, which arises, for example, in the design of smoke detectors, spam (unwanted commercial e-mail message) filters, database queries, and authentication devices. The general model of binary classification is that we wish to classify a set of things into two categories based on the presence or absence of some property, but we lack a direct measure of that property. We therefore instead identify and use some indirect measure, known as a proxy. Occasionally, this scheme misclassifies something. By adjusting parameters of the proxy, the designer may be able to reduce one class of mistakes (in the case of a smoke detector, unnoticed fires; for a spam filter, legitimate messages marked as spam), but only at the cost of increasing some other class of mistakes (for the smoke detector, false alarms; for the spam filter, spam marked as legitimate messages). Appendix A explores the binary classification trade-off in more detail. Much of a system designer’s intellectual effort goes into evaluating various kinds of trade-offs.
KISS:保持简单,傻瓜。
KISS: Keep It Simple, Stupid.
— 传统管理民间传说;来源已消失在历史的迷雾中
— traditional management folklore; source lost in the mists of time
突发特性、效应传播、不相称的扩展和权衡是设计者在每个系统中都必须处理的问题。问题是面对这些问题时如何构建有用的计算机系统。理想情况下,我们希望描述一种建设性理论,这种理论允许设计者系统地根据其规范合成系统并精确地做出必要的权衡,就像在通信系统、线性控制系统以及(在一定程度上)桥梁和摩天大楼的设计等领域存在建设性理论一样。不幸的是,就计算机系统而言,我们发现我们似乎出生得太早了。虽然我们早期的出现为开发缺失的理论带来了挑战,但问题很快就显现出来——我们的工作几乎完全是通过分析临时示例而不是通过综合来实现的。
Emergent properties, propagation of effects, incommensurate scaling, and trade-offs are issues that the designer must deal with in every system. The question is how to build useful computer systems in the face of such problems. Ideally, we would like to describe a constructive theory, one that allows the designer systematically to synthesize a system from its specifications and to make necessary trade-offs with precision, just as there are constructive theories in such fields as communications systems, linear control systems, and (to a certain extent) the design of bridges and skyscrapers. Unfortunately, in the case of computer systems, we find that we were apparently born too soon. Although our early arrival on the scene offers the challenge to develop the missing theory, the problem is quickly apparent—we work almost entirely by analyzing ad hoc examples rather than by synthesizing.
愚人忽视复杂性。实用主义者遭受复杂性。有些人可以避免复杂性。天才可以消除复杂性。
Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it.
— Alan J. Perlis,《编程中的警句》(1982)
— Alan J. Perlis, “Epigrams in Programming” (1982)
因此,我们用案例研究来代替条理清晰的理论。对于本书中的每个子主题,我们首先要确定需求,显然是为了从需求中得出系统结构。然后,我们几乎立即转向案例研究,并反向研究实际现场系统如何满足我们设定的需求。在此过程中,我们指出了从需求合成系统的系统方法开始出现的地方,并介绍了在描述和构建系统中已被证明有用的表示、抽象和设计原则。这项研究的预期结果是洞察设计师如何创建真实系统。
So, in place of a well-organized theory, we use case studies. For each subtopic in this book, we shall begin by identifying requirements with the apparent intent of deriving the system structure from the requirements. Then, almost immediately we switch to case studies and work backwards to see how real, in-the-field systems meet the requirements that we have set. Along the way we point out where systematic approaches to synthesizing a system from its requirements are beginning to emerge, and we introduce representations, abstractions, and design principles that have proven useful in describing and building systems. The intended result of this study is insight into how designers create real systems.
韦氏第三版新国际词典未删节版将系统定义为“由许多通常不同的部分组成的复杂整体,这些部分服从于一个共同的计划或服务于一个共同的目的”。虽然这个定义在日常使用中已经足够,但工程师通常更喜欢更具体一些的定义。我们将“许多通常不同的部分”称为“组件”。我们通过组件的互连来识别“整体”和“共同计划” ,并将系统的“共同目的”视为在其与环境的接口上表现出某种行为。因此,我们制定了技术定义:系统是一组互连的组件,其在与环境的接口上具有预期的行为。
Webster’s Third New International Dictionary, Unabridged, defines a system as “a complex unity formed of many often diverse parts subject to a common plan or serving a common purpose.” Although this definition will do for casual use of the word, engineers usually prefer something a bit more concrete. We identify the “many often diverse parts” by naming them components. We identify the “unity” and “common plan” with the interconnections of the components, and we perceive the “common purpose” of a system to be to exhibit a certain behavior across its interface to an environment. Thus we formulate our technical definition: A system is a set of interconnected components that has an expected behavior observed at the interface with its environment.
系统概念的基本思想是将世界上的所有事物分为两类:正在讨论的事物和未讨论的事物。正在讨论的事物是系统的一部分,未讨论的事物是环境的一部分。例如,我们可以将太阳系定义为由太阳、行星、小行星和彗星组成。太阳系的环境是宇宙的其余部分。(事实上,“宇宙”一词是环境的同义词。)
The underlying idea of the concept of system is to divide all the things in the world into two groups: those under discussion and those not under discussion. Those things under discussion are part of the system—those that are not are part of the environment. For example, we might define the solar system as consisting of the sun, planets, asteroids, and comets. The environment of the solar system is the rest of the universe. (Indeed, the word “universe” is a synonym for environment.)
系统与其环境之间总是存在相互作用;这些相互作用是系统与环境之间的界面。太阳系与宇宙其他部分之间的界面包括对最近恒星的引力吸引和电磁辐射交换。个人计算机的主要界面通常包括显示器、键盘、扬声器、网络连接和电源线等,但也有不太明显的界面,例如大气压力、环境温度和湿度以及电磁噪声环境。
There are always interactions between a system and its environment; these interactions are the interface between the system and the environment. The interface between the solar system and the rest of the universe includes gravitational attraction for the nearest stars and the exchange of electromagnetic radiation. The primary interfaces of a personal computer typically include things such as a display, keyboard, speaker, network connection, and power cord, but there are also less obvious interfaces such as the atmospheric pressure, ambient temperature and humidity, and the electromagnetic noise environment.
人们研究一个系统是为了根据其组件、组件之间的相互联系以及各个组件的行为来预测其整体行为。然而,识别组件取决于人们的观点,它包括两个方面:目的和粒度。人们可能出于不同的目的,以完全不同的方式看待一个系统。人们也可以选择几种不同的粒度。这些选择对人们识别系统组件有重要影响。
One studies a system to predict its overall behavior, based on information about its components, their interconnections, and their individual behaviors. Identifying the components, however, depends on one’s point of view, which has two aspects, purpose and granularity. One may, with different purposes in mind, look at a system quite differently. One may also choose any of several different granularities. These choices affect one’s identification of the components of the system in important ways.
而简单性是我们为了可靠性必须付出的不可避免的代价。
And simplicity is the unavoidable price we must pay for reliability.
— Charles Anthony Richard Hoare,《数据可靠性》(1975 年)
— Charles Anthony Richard Hoare, “Data Reliability” (1975)
要了解观点如何取决于目的,请考虑将喷气式飞机视为一个系统的两种观点。第一种观点将飞机视为飞行物体,其中系统的组件包括机身、机翼、控制面和发动机。环境是大气和地球,界面由重力、发动机推力和空气阻力组成。第二种观点将飞机视为乘客处理系统。现在,组件包括座椅、乘务员、空调系统和厨房。环境是一组乘客,界面是座椅的柔软度、餐食和从空调系统流出的空气。
To see how point of view can depend on purpose, consider two points of view of a jet aircraft as a system. The first looks at the aircraft as a flying object, in which the components of the system include the body, wings, control surfaces, and engines. The environment is the atmosphere and the earth, with interfaces consisting of gravity, engine thrust, and air drag. A second point of view looks at the aircraft as a passenger-handling system. Now, the components include seats, flight attendants, the air conditioning system, and the galley. The environment is the set of passengers, and the interfaces are the softness of the seats, the meals, and the air flowing from the air conditioning system.
从第一种观点来看,飞机是一个飞行物体,有座椅、乘务员和厨房,但设计师主要将它们视为重量的贡献者。相反,从第二种观点来看,飞机是一个乘客处理系统,设计师将发动机视为噪音源,也可能是废气源,并可能忽略机翼上的控制面。因此,根据不同的观点,我们可能会选择忽略或合并某些系统组件或接口。
In the first point of view, the aircraft as a flying object, the seats, flight attendants, and galley were present, but the designer considers them primarily as contributors of weight. Conversely, in the second point of view, as a passenger-handling system, the designer considers the engine as a source of noise and perhaps also exhaust fumes, and probably ignores the control surfaces on the wings. Thus, depending on point of view, we may choose to ignore or consolidate certain system components or interfaces.
选择粒度的能力意味着,在一种情况下,组件可能是另一种情况下的整个系统。从飞机设计师的角度来看,喷气发动机是一个提供重量、推力甚至阻力的部件。另一方面,发动机制造商将其视为一个独立的系统,它有许多部件——涡轮机、液压泵、轴承、加力燃烧器,所有这些部件都以各种方式相互作用以产生推力——一个与发动机环境的接口。支撑发动机的机翼是飞机系统的一个部件,但它是发动机系统环境的一部分。
The ability to choose granularity means that a component in one context may be an entire system in another. From an aircraft designer’s point of view, a jet engine is a component that contributes weight, thrust, and perhaps drag. On the other hand, the manufacturer of the engine views it as a system in its own right, with many components—turbines, hydraulic pumps, bearings, afterburners, all of which interact in diverse ways to produce thrust—one interface with the environment of the engine. The airplane wing that supports the engine is a component of the aircraft system, but it is part of the environment of the engine system.
如果一个系统在某种情况下是另一个情况下的组件,则通常将其称为子系统(但请参见边栏 1.3)。由子系统组成系统或将系统分解为子系统可以进行到尽可能多的层次。
When a system in one context is a component in another, it is usually called a subsystem (but see Sidebar 1.3). The composition of systems from subsystems or decomposition of systems into subsystems can be carried on to as many levels as is useful.
边栏 1.3 术语 用于描述系统组成的词语
Sidebar 1.3 Terminology Words used to Describe System Composition
由于系统可以包含组件子系统,而从不同角度来看,这些子系统本身也是系统,因此系统的分解是递归的。为了避免在写作中出现递归,作者和设计者想出了一个长长的同义词列表,所有这些同义词都试图捕捉同一个概念:系统、子系统、组件、元素、成分、对象、模块、子模块、组件、子组件等等。
Since systems can contain component subsystems that are themselves systems from a different point of view, decomposition of systems is recursive. To avoid recursion in their writing, authors and designers have come up with a long list of synonyms, all trying to capture this same concept: systems, subsystems, components, elements, constituents, objects, modules, submodules, assemblies, subassemblies, and so on.
总而言之,要分析一个系统,就必须建立一个观点来确定哪些东西可以视为组件,这些组件的粒度应该是多少,系统的边界在哪里,以及系统与其环境之间的哪些接口是感兴趣的。
In summary, then, to analyze a system one must establish a point of view to determine which things to consider as components, what the granularity of those components should be, where the boundary of the system lies, and which interfaces between the system and its environment are of interest.
Pluralitas non est ponenda sine neccesitate.(若无必要,不应假定复数。)
Pluralitas non est ponenda sine neccesitate. (Plurality should not be assumed without necessity.)
— 奥卡姆的威廉(14 世纪。俗称“奥卡姆剃刀”,但据说这一思想本身出现在更古老的著作中。)
— William of Ockham (14th century. Popularly known as “Occam’s razor,” though the idea itself is said to appear in writings of greater antiquity.)
在我们使用的术语中,计算机系统或信息系统是一种旨在自动控制下存储、处理或传达信息的系统。此外,我们对以数字为主的系统感兴趣。以下是一些示例:
As we use the term, a computer system or an information system is a system intended to store, process, or communicate information under automatic control. Further, we are interested in systems that are predominantly digital. Here are some examples:
the onboard engine controller of an automobile
an airline ticket reservation system
同时,我们有时会发现,看看非数字和非自动化信息处理系统的例子(例如邮局或图书馆),以获得想法和指导很有用。
At the same time we will sometimes find it useful to look at examples of nondigital and nonautomated information handling systems, such as the post office or library, for ideas and guidance.
韦氏词典对“系统”的定义使用了“复杂”一词。查找该术语,我们发现复杂意味着“难以理解”。缺乏系统理解是复杂性的根本特征。因此,复杂性既是一个主观概念,也是一个相对概念。也就是说,人们可以认为一个系统比另一个系统更复杂,但即使人们可以统计出似乎导致复杂性的各种因素,也没有统一的衡量标准。即使一个系统比另一个系统更复杂的论点也很难令人信服——同样是因为缺乏统一的衡量标准。我们可以借用医学中的一种技术来代替这种衡量标准:描述一组有助于确认诊断的复杂性迹象。作为必然结果,我们放弃了对复杂性进行明确描述的希望。我们必须寻找它的迹象,如果出现足够多的迹象,就认为复杂性是存在的。为此,以下是复杂性的五个迹象:
Webster’s definition of “system” used the word “complex”. Looking up that term, we find that complex means “difficult to understand”. Lack of systematic understanding is the underlying feature of complexity. It follows that complexity is both a subjective and a relative concept. That is, one can argue that one system is more complex than another, but even though one can count up various things that seem to contribute to complexity, there is no unified measure. Even the argument that one system is more complex than another can be difficult to make compelling—again because of the lack of a unified measure. In place of such a measure, we can borrow a technique from medicine: describe a set of signs of complexity that can help confirm a diagnosis. As a corollary, we abandon hope of producing a definitive description of complexity. We must instead look for its signs, and if enough appear, argue that complexity is present. To that end, here are five signs of complexity:
1.组件数量庞大。庞大的规模肯定会影响我们对系统是否符合“复杂”这一描述的看法。
1. Large number of components. Sheer size certainly affects our view of whether or not a system rates the description “complex”.
2.大量互连。即使是少数组件也可能以难以管理的大量方式互连。例如,太阳和已知行星仅由少数组件组成,但每个组件都对其他组件具有引力,这导致一组方程无法用现有数学技术求解(封闭形式)。更糟糕的是,经过一段时间,一个小的扰动就会导致轨道发生巨大变化。由于对扰动的敏感性,太阳系在技术上是混乱的。虽然计算机系统没有混乱的正式定义,但该术语经常非正式地应用。
2. Large number of interconnections. Even a few components may be interconnected in an unmanageably large number of ways. For example, the Sun and the known planets comprise only a few components, but every one has gravitational attraction for every other, which leads to a set of equations that are unsolvable (in closed form) with present mathematical techniques. Worse, a small disturbance can, after a while, lead to dramatically different orbits. Because of this sensitivity to disturbance, the solar system is technically chaotic. Although there is no formal definition of chaos for computer systems, that term is often informally applied.
3.很多不规则性。如果组件重复且互连有规律,大量的组件和互连本身可能仍代表一个简单的系统。然而,如果缺乏规律性,如异常数量或非重复互连安排所显示的那样,则强烈表明复杂性。换句话说,异常会使理解变得复杂。
3. Many irregularities. By themselves, a large number of components and interconnections may still represent a simple system, if the components are repetitive and the interconnections are regular. However, a lack of regularity, as shown by the number of exceptions or by non-repetitive interconnection arrangements, strongly suggests complexity. Put another way, exceptions complicate understanding.
4.长描述。查看系统的最佳描述,会发现它由一长串属性组成,而不是解释每个方面的简短、系统的规范。理论家通过测量他们所谓的计算对象的“Kolmogorov 复杂度”作为其最短规范的长度来形式化这一想法。在一定程度上,这个标志可能只是前三个标志的反映,尽管它强调了复杂性的一个重要方面:它与理解有关。另一方面,缺乏系统描述也可能表明系统由不合适的组件构成,组织不良,或者可能具有不可预测的行为,其中任何一个都会增加设计和使用的复杂性。
4. A long description. Looking at the best available description of the system one finds that it consists of a long laundry list of properties rather than a short, systematic specification that explains every aspect. Theoreticians formalize this idea by measuring what they call the “Kolmogorov complexity” of a computational object as the length of its shortest specification. To a certain extent, this sign may be merely a reflection of the previous three, although it emphasizes an important aspect of complexity: it is relative to understanding. On the other hand, lack of a methodical description may also indicate that the system is constructed of ill-fitting components, is poorly organized, or may have unpredictable behavior, any of which add complexity to both design and use.
5.一个由设计人员、实施人员或维护人员组成的团队。需要几个人来理解、构建或维护系统。任何系统的一个基本问题是它是否足够简单,一个人就能理解它的全部。如果不是,那么它就是一个复杂的系统,因为它的描述、构建或维护不仅需要技术专业知识,还需要团队之间的协调和沟通。
5. A team of designers, implementers, or maintainers. Several people are required to understand, construct, or maintain the system. A fundamental issue in any system is whether or not it is simple enough for a single person to understand all of it. If not, it is a complex system because its description, construction, or maintenance will require not just technical expertise but also coordination and communication across a team.
似乎完美主义在最值得添加的时候就试图实现它,但是当最值得重新添加的时候就试图实现它。 (就好像完美不是当没有什么可以添加时才达到,而是当没有什么可以减去时才达到。)
Il semble que la perfection soit atteinte non quand il n’y a plus rien à ajouter, mais quand il n’y a plus rien à retrancher. (It is as if perfection be attained not when there is nothing more to add, but when there is nothing more to take away.)
— 安东尼·德·圣·埃克苏佩里,《人的土地》(1939年)
— Antoine de Saint-Exupéry, Terre des Hommes (1939)
再举一个例子来说明这一点:对比一下小镇图书馆和大型大学图书馆。两者在规模上明显存在差异:大学图书馆的藏书更多,因此存在第一个迹象。第二个迹象更微妙:小型图书馆可能有目录来指导用户,而大学图书馆可能不仅有目录,还有检索工具、读者指南、文摘服务、期刊索引等。虽然这些细节使大型图书馆更有用(至少对有经验的用户而言),但它们也使向图书馆添加新项目的任务变得复杂:必须有人添加许多互连(在这种情况下是交叉引用),以便能够以所有预期的方式找到新项目。第三个迹象,即大量例外情况,也很明显。小型图书馆只有少数分类(小说、传记、非小说和杂志)和少数例外情况(超大书籍放在报刊架上),而大学图书馆则充斥着例外情况。有些书是超大尺寸的,有些是缩微胶卷或数字媒体的,有些书是稀有或有价值的,必须加以保护,有些书讲解如何制造氢弹,只能借给某些读者,有些书无法在任何标准分类系统中编目。至于第四个迹象,任何大型大学图书馆的用户都会确认,没有系统规则来查找信息,图书馆的使用是一门艺术,而不是一门科学。
Again, an example can illustrate: contrast a small-town library with a large university library. There is obviously a difference in scale: the university has more books, so the first sign is present. The second sign is more subtle: where the small library may have a catalog to guide the user, the university library may have not only a catalog, but also finding aids, readers’ guides, abstracting services, journal indexes, and so on. Although these elaborations make the large library more useful (at least to the experienced user), they also complicate the task of adding a new item to the library: someone must add many interconnections (in this case, cross-references) so that the new item can be found in all the intended ways. The third sign, a large number of exceptions, is also apparent. Where the small library has only a few classifications (fiction, biography, nonfiction, and magazines) and a few exceptions (oversized books are kept over the newspaper rack), the university library is plagued with exceptions. Some books are oversized, others come on microfilm or on digital media, some books are rare or valuable and must be protected, the books that explain how to build a hydrogen bomb can be loaned only to certain patrons, some defy cataloging in any standard classification system. As for the fourth sign, any user of a large university library will confirm that there are no methodical rules for locating a piece of information and that library usage is an art, not a science.
这是简单之恩赐,这是自由之恩赐,
’Tis the gift to be simple, ’tis the gift to be free,
这是上天赐予我们的礼物,让我们到达我们应该去的地方;
’Tis the gift to come down where we ought to be;
当我们发现自己处于正确的位置时,
And when we find ourselves in the place just right,
它位于爱与欢乐的山谷中。
’Twill be in the valley of love and delight.
当获得真正的简单
When true simplicity is gained
鞠躬弯腰我们不会感到羞耻;
To bow and to bend we shan’t be ashamed;
转身,转身是我们的快乐,
To turn, turn will be our delight,
直到转过身来,我们才到达正确的方向。
Till by turning, turning we come round right.
—简单的礼物,传统 Shaker 赞美诗
— Simple Gifts, traditional Shaker hymn
最后,第五个复杂标志是拥有多名员工,这在大学图书馆中尤为明显。许多小城镇实际上只有一名图书管理员——通常是一个精力充沛的人,熟悉每本书,因为他或她曾经有机会接触过这些书——而大学图书馆不仅拥有许多工作人员,甚至还有只熟悉图书馆运营某一方面的专家,例如微缩胶片收藏。
Finally, the fifth sign of complexity, a staff of more than one person, is evident in the university library. Where many small towns do in fact have just one librarian—typically an energetic person who knows each book because at one time or another he or she has had occasion to touch it—the university library has not only many personnel, but even specialists who are familiar with only one facet of library operations, such as the microform collection.
大学图书馆展现了复杂性的所有五个特征,但一致性并不是必不可少的。另一方面,仅存在一个或两个特征可能无法令人信服地证明复杂性。热力学中考虑的系统包含数量惊人的成分(基本粒子)和相互作用,但从正确的角度来看,它们并不算复杂,因为它们的行为有简单、有条理的描述。正是当我们缺乏这样一种简单、有条理的描述时,我们才具有复杂性。
The university library exhibits all five signs of complexity, but unanimity is not essential. On the other hand, the presence of only one or two of the signs may not make a compelling case for complexity. Systems considered in thermodynamics contain an unthinkably large number of components (elementary particles) and interactions, yet from the right point of view they do not qualify as complex because there is a simple, methodical description of their behavior. It is exactly when we lack such a simple, methodical description that we have complexity.
有人反对将复杂性视为基于这五个迹象,认为所有系统都是无限复杂的,甚至可能是无限复杂的,因为挖掘得越深,就会出现越多的复杂性迹象。因此,即使是最简单的数字计算机也是由门组成的,门由晶体管组成,晶体管由硅组成,硅由质子、中子和电子组成,电子由夸克组成,一些物理学家认为夸克可以描述为振动的弦,等等。我们稍后将通过限制挖掘深度(一种称为抽象的技术)来解决这一反对意见。我们感兴趣和担心的复杂性是尽管使用了抽象,但仍然存在的复杂性。
One objection to conceiving complexity as being based on the five signs is that all systems are indefinitely, perhaps infinitely, complex because the deeper one digs the more signs of complexity turn up. Thus, even the simplest digital computer is made of gates, which are made with transistors, which are made of silicon, which is composed of protons, neutrons, and electrons, which are composed of quarks, which some physicists suggest are describable as vibrating strings, and so on. We shall address this objection in a moment by limiting the depth of digging, a technique known as abstraction. The complexity that we are interested in and worried about is the complexity that remains despite the use of abstraction.
人类所建造的一切……人类的一切……努力……最终都会成果……其唯一的指导原则就是……简单……发明的完美与发明的缺失交织在一起,仿佛……有一条线尚未被发明,但……在开始……被大自然隐藏,最终……被工程师发现。
Whatever man builds … all of man’s … efforts … invariably culminate in … a thing whose sole and guiding principle is … simplicity … perfection of invention touches hands with absence of invention, as if … [there] were a line that had not been invented but … [was] in the beginning … hidden by nature and in the end … found by the engineer.
— 安东尼·德·圣·埃克苏佩里,《人的土地》(1939年)
— Antoine de Saint-Exupéry, Terre des Hommes (1939)
造成复杂性的因素有很多,但有两个因素值得特别提及。第一个因素是设计人员希望系统满足的要求数量。第二个因素是一个特殊要求:保持高利用率。
There are many sources of complexity, but two merit special mention. The first is in the number of requirements that the designer expects a system to meet. The second is one particular requirement: maintaining high utilization.
复杂性的主要来源就是系统的需求列表。从自身来看,每个需求似乎都很简单。任何特定需求甚至可能只会给现有需求列表增加易于容忍的复杂性。问题是,许多需求的积累不仅增加了它们各自的复杂性,还增加了它们相互作用的复杂性。这种相互作用的复杂性源于对通用性和例外情况的压力,这些情况增加了复杂性,而随着时间的推移,个别需求的变化使情况变得更糟。
A primary source of complexity is just the list of requirements for a system. Each requirement, viewed by itself, may seem straightforward. Any particular requirement may even appear to add only easily tolerable complexity to an existing list of requirements. The problem is that the accumulation of many requirements adds not only their individual complexities but also complexities from their interactions. This interaction complexity arises from pressure for generality and exceptions that add complications, and it is made worse by change in individual requirements over time.
到目前为止,大多数个人计算机用户都遇到过以下场景:供应商宣布发布您用来管理支票簿的程序的新版本,新版本具有一些似乎很重要或有用的功能(例如,它可以处理最新的在线银行系统),因此您订购了该程序。在尝试安装它时,您发现这个新版本需要某个共享库包的较新版本。您找到该较新版本并进行安装,却发现库包需要较新版本的操作系统,而您之前没有任何理由安装该操作系统。您忍痛安装了最新版本的操作系统,现在支票簿程序可以正常工作,但您的附加硬盘开始出现问题。经调查发现,磁盘供应商的专有软件与新操作系统版本不兼容。不幸的是,磁盘供应商仍在调试磁盘软件的更新,最好的可用版本是月底将到期的 beta 测试版。
Most users of a personal computer have by now encountered some version of the following scenario: The vendor announces a new release of the program you use to manage your checkbook, and the new release has some feature that seems important or useful (e.g., it handles the latest on-line banking systems), so you order the program. Upon trying to install it, you discover that this new release requires a newer version of some shared library package. You track down that newer version and install it, only to find that the library package requires a newer version of the operating system, which you had not previously had any reason to install. Biting the bullet, you install the latest release of the operating system, and now the checkbook program works, but your add-on hard disk begins to act flaky. On investigation it turns out that the disk vendor’s proprietary software is incompatible with the new operating system release. Unfortunately, the disk vendor is still debugging an update for the disk software, and the best thing available is a beta test version that will expire at the end of the month.
造成这种情况的根本原因是个人计算机的设计满足了许多要求:组织良好的文件系统、存储的可扩展性、连接各种 I/O 设备的能力、网络连接、保护网络中其他位置免受恶意人员的攻击、可用性、可靠性、低成本——不胜枚举。这些要求中的每一个都增加了自身的复杂性,而它们之间的相互作用又增加了更多的复杂性。
The underlying cause of this scenario is that the personal computer has been designed to meet many requirements: a well-organized file system, expandability of storage, ability to attach a variety of I/O devices, connection to a network, protection from malevolent persons elsewhere in the network, usability, reliability, low cost—the list goes on and on. Each of these requirements adds complexity of its own, and the interactions among them add still more complexity.
同样,多年来,电话系统也获得了大量的线路定制功能 - 呼叫等待、回拨、呼叫转接、发起和终止呼叫阻止、反向计费、来电显示、来电显示阻止、拒绝匿名呼叫、请勿打扰、休假保护 - 再次,这个列表可以很长。这些功能以多种方式相互作用,以至于电话系统中存在一个“功能交互”的研究领域。这项研究始于对应该发生什么的争论。例如,所谓的 900 号码具有称为反向计费的功能 - 被叫方可以在呼叫者的账单上收取费用。爱丽丝(爱丽丝是我们在侧边栏 1.4中描述的角色阵容中遇到的第一个角色)具有阻止拨出到反向计费号码的呼叫的功能。爱丽丝呼叫鲍勃,鲍勃的电话被转接到 900 号码。电话应该接通吗?如果是,哪一方应该付费,鲍勃还是爱丽丝?有三个相互作用的功能和至少四种不同的可能性:阻止呼叫、允许呼叫并向 Bob 收费、响铃 Bob 的电话,或者添加另一个功能(每月收费)让 Bob 选择结果。
Similarly, the telephone system has, over the years, acquired a large number of line customizing features—call waiting, call return, call forwarding, originating and terminating call blocking, reverse billing, caller ID, caller ID blocking, anonymous call rejection, do not disturb, vacation protection—again, the list goes on and on. These features interact in so many ways that there is a whole field of study of “feature interaction” in telephone systems. The study begins with debates over what should happen. For example, so-called 900 numbers have the feature called reverse billing—the called party can place a charge on the caller’s bill. Alice (Alice is the first character we have encountered in our cast of characters, described in Sidebar 1.4) has a feature that blocks outgoing calls to reverse billing numbers. Alice calls Bob, whose phone is forwarded to a 900 number. Should the call go through, and if so, which party should pay for it, Bob or Alice? There are three interacting features, and at least four different possibilities: block the call, allow the call and charge it to Bob, ring Bob’s phone, or add yet another feature that (for a monthly fee) lets Bob choose the outcome.
边栏 1.4 角色和组织
Sidebar 1.4 The Cast of Characters and Organizations
在本书的具体例子中,读者会遇到一组标准角色,他们分别是爱丽丝、鲍勃、查尔斯、道恩、艾拉和费利佩。爱丽丝通常是消息的发送者,而鲍勃是消息的接收者。查尔斯有时是爱丽丝和鲍勃的共同熟人。其他人则根据示例扮演各种配角。当我们谈到安全时,就会出现一个名为路西法的敌对角色。路西法的角色是破解安全措施,并可能干扰其他角色可能有用的工作。
In concrete examples throughout this book, the reader will encounter a standard cast of characters named Alice, Bob, Charles, Dawn, Ella, and Felipe. Alice is usually the sender of a message, and Bob is its recipient. Charles is sometimes a mutual acquaintance of Alice and Bob. The others play various supporting roles, depending on the example. When we come to security, an adversarial character named Lucifer will appear. Lucifer’s role is to crack the security measures and perhaps interfere with the presumably useful work of the other characters.
书中还介绍了几个虚构的组织。有两所大学:Pedantic University(互联网上的网址为Pedantic.edu)和 The Institute of Scholarly Studies(互联网上的网址为Scholarly.edu)。互联网上还有四个虚构的商业组织:TrustUs.com、ShopWithUs.com、Awesome.net和Awful.net。
The book also introduces a few fictional organizations. There are two universities: Pedantic University, on the Internet at Pedantic.edu, and The Institute of Scholarly Studies, at Scholarly.edu. There are also four mythical commercial organizations on the Internet at TrustUs.com, ShopWithUs.com, Awesome.net, and Awful.net.
麻省理工学院教授 Ronald Rivest 在《进一步阅读建议》11.5.1中将 Alice 和 Bob 介绍给了计算机科学文献。如与在世或已故的人或真实或虚构的组织有任何相似之处,纯属巧合。
M.I.T. Professor Ronald Rivest introduced Alice and Bob to the literature of computer science in Suggestions for Further Reading 11.5.1. Any other resemblance to persons living or dead or organizations real or imaginary is purely coincidental.
当你有疑问时,要坚定地表达自己了解的事情。
When in doubt, make it stout, and of things you know about.
如有疑问,请忽略。
When in doubt, leave it out.
— 汽车行业的民间谚语
— folklore sayings from the automobile industry
这些例子表明,其中存在一个基本原则。我们称之为:
The examples suggest that there is an underlying principle at work. We call it the:
复杂性升级原则
Principle of Escalating Complexity
添加要求会不成比例地增加复杂性。
Adding a requirement increases complexity out of proportion.
这个原则是主观的,因为复杂性本身就是主观的——其大小取决于观察者的想法。图 1.1提供了该原则的图形解释。在研究这个图时,也许最重要的是要认识到复杂性障碍是软性的:当你添加功能和需求时,你不会遇到一个坚实的障碍来警告你停止添加。情况只会变得更糟。
The principle is subjective because complexity itself is subjective—its magnitude is in the mind of the beholder. Figure 1.1 provides a graphical interpretation of the principle. Perhaps the most important thing to recognize in studying this figure is that the complexity barrier is soft: as you add features and requirements, you don’t hit a solid roadblock to warn you to stop adding. It just gets worse.
图 1.1复杂性升级原则。
Figure 1.1 The principle of escalating complexity.
完美必须循序渐进地达到;她需要时间的缓慢流逝。
Perfection must be reached by degrees; she requires the slow hand of time.
— 弗朗索瓦-玛丽·阿鲁埃(伏尔泰)
— attributed to François-Marie Arouet (Voltaire)
随着要求数量的增加,例外情况的数量也会增加,因此复杂性也会增加。美国税法中特殊情况的数量令人难以置信,这使填写所得税申报表成为一项复杂的工作。任何一项例外的影响可能很小,但许多相互作用的例外累积起来的影响会使系统变得非常复杂,以致没有人能够理解它。复杂性还可能来自外部要求,比如坚持某种部件必须来自某个特定供应商。该部件可能不如其他供应商的耐用性好、更重或不那么容易买到。这些特性可能不会妨碍其使用,但它们增加了系统其他部分的复杂性,而系统其他部分必须进行设计来弥补这些复杂性。
As the number of requirements grows, so can the number of exceptions and thus the complications. It is the incredible number of special cases in the United States tax code that makes filling out an income tax return a complex job. The impact of any one exception may be minor, but the cumulative impact of many interacting exceptions can make a system so complex that no one can understand it. Complications also can arise from outside requirements such as insistence that a certain component must come from a particular supplier. That component may be less durable, heavier, or not as available as one from another supplier. Those properties may not prevent its use, but they add complexity to other parts of the system that have to be designed to compensate.
用单一设计满足多种需求有时被表达为对通用性的需求。通用性可以宽泛地定义为“适用于各种情况”。不幸的是,通用性会增加复杂性,因此它需要权衡,设计师必须运用良好的判断力来决定真正需要多少通用性。举一个极端的例子,一辆有四个独立方向盘的汽车,每个方向盘控制一个轮胎,提供了某种终极通用性,但几乎所有这些通用性都是不受欢迎的。在这里,不受欢迎的方面和由此产生的汽车导航复杂性都很明显,但在许多情况下,这两个方面都更难评估:一种提议的通用性形式在多大程度上使系统复杂化,这种通用性在多大程度上真正有用?不受欢迎的通用性也会间接增加复杂性:具有过度通用性的系统的用户将采用简化和抑制他们不需要的通用性的使用方式。不同的用户可能会采用不同的方式,然后发现他们无法轻松地相互交换想法。任何试图使用别人定制的个人电脑的人都会注意到这个问题。
Meeting many requirements with a single design is sometimes expressed as a need for generality. Generality may be loosely defined as “applying to a variety of circumstances.” Unfortunately, generality contributes to complexity, so it comes with a trade-off, and the designer must use good judgment to decide how much of the generality is actually wanted. As an extreme example, an automobile with four independent steering wheels, each controlling one tire, offers some kind of ultimate in generality, almost all of which is unwanted. Here, both the aspect of unwantedness and the resulting complexity of guidance of the auto are obvious enough, but in many cases both of these aspects are more difficult to assess: How much does a proposed form of generality complicate the system, and to what extent is that generality really useful? Unwanted generality also contributes to complexity indirectly: users of a system with excessive generality will adopt styles of usage that simplify and suppress generality that they do not need. Different users may adopt different styles and then discover that they cannot easily exchange ideas with one another. Anyone who tries to use a personal computer customized by someone else will notice this problem.
时不时地,有人会尝试设计一种可以在高速公路上行驶、飞行和用作船的交通工具,但这种通用设计的结果似乎在任何预期的交通方式中都无法很好地发挥作用。为了帮助对抗过度的通用性,经验表明了另一项设计原则:*
Periodically, someone tries to design a vehicle that one can drive on the highway, fly, and use as a boat, but the result of such a general design does not seem to work well in any of the intended modes of transport. To help counter excessive generality, experience suggests another design principle:*
最好是好的敌人。
The best is the enemy of the good.
- 弗朗索瓦-玛丽·阿鲁埃(伏尔泰),《哲学词典》(1764年)
— François-Marie Arouet (Voltaire), Dictionnaire Philosophique (1764)
避免过于笼统
Avoid Excessive Generality
如果它对所有事情都好,那它就对任何事都不好。
If it is good for everything, it is good for nothing.
例外和普遍性之间存在着矛盾。设计子系统的艺术之一就是使其功能足够普遍,以尽量减少必须作为特殊情况处理的例外数量。这是系统设计师判断力最明显的领域。
There is a tension between exceptions and generality. Part of the art of designing a subsystem is to make its features general enough to minimize the number of exceptions that must be handled as special cases. This area is one where the judgment of the system designer is most evident.
抵消不相称的扩展效应可能是复杂性的另一个来源。霍尔丹在他的文章“论合适的尺寸”中指出,昆虫等小型生物通过皮肤吸收足够的氧气以生存,但较大的生物需要的氧气量与其线性尺寸的立方成正比,没有足够的表面积。为了弥补这种不相称的扩展,它们增加了肺和血管的复杂性,以吸收和输送氧气到全身。以计算机为例,控制烤面包机的 4 位微处理器的程序员可以在几天内成功地用二进制数字编写所需的代码,而使用 64 位处理器和 40 GB 支持数据的视频游戏的程序员需要大量工具——编译器、图像或视频编辑器、特效生成器等,以及操作系统,才能在一生中完成工作。不相称的扩展需要使用一套更为复杂的工具。
Counteracting the effects of incommensurate scaling can be an additional source of complexity. Haldane, in his essay “On being the right size”, points out that small organisms such as insects absorb enough oxygen to survive through their skins, but larger organisms, which require an amount of oxygen proportional to the cube of their linear size, don’t have enough surface area. To compensate for this incommensurate scaling, they add complexity in the form of lungs and blood vessels to absorb and deliver oxygen throughout their bodies. In the case of computers, the programmer of a 4-bit microprocessor to control a toaster can in a few days successfully write the needed code entirely with binary numbers, while the programmer of a video game with a 64-bit processor and 40 gigabytes of supporting data requires an extensive array of tools—compilers, image or video editors, special effects generators, and the like, as well as an operating system, to be able to get the job done within a lifetime. Incommensurate scaling has required employment of a far more complex set of tools.
最后,复杂性的一个主要来源是需求的变化。成功的系统设计通常会长期使用,在此期间系统的环境会发生变化。硬件技术的改进可能会导致系统维护人员想要升级到更快、更便宜或更可靠的设备。同时,有关如何维护旧设备(以及备件供应)的知识可能会消失。随着用户对系统经验的积累,会越来越清楚地认识到,一些额外的要求应该是设计的一部分,而一些原始要求并不像最初想象的那么重要。系统规模通常会扩大,有时会远远超出其原始设计者的设想。
Finally, a major source of complexity is that requirements change. System designs that are successful usually remain in use for a long time, during which the environment of the system changes. Improvements in hardware technology may lead the system maintainers to want to upgrade to faster, cheaper, or more reliable equipment. Meanwhile, knowledge of how to maintain the older equipment (and the supply of spare parts) may be disappearing. As users accumulate experience with the system, it becomes clearer that some additional requirements should have been part of the design and that some of the original requirements were less important than originally thought. Often a system will expand in scale, sometimes far beyond the vision of its original designers.
在每种情况下,最初设计人员开发系统时所采用的基本规则和假设都开始失去其意义。系统设计人员可能已经预见到一些环境变化,但可能还存在一些他们没有预料到的变化。为了满足不可预见的需求而发生的变更通常会增加复杂性。由于更改已部署系统的架构可能很困难(第 1.3 节解释了原因),因此人们有强烈的动机在现有架构内进行更改,无论这是否是最好的选择。影响的传播可能会放大由变更引起的问题,因为变更的更远的影响可能直到有人调用某些很少使用的功能时才会被注意到。当这些远距离影响最终浮现时,维护人员可能会再次发现最简单的方法是在本地处理它们,也许是通过添加例外。当后来的维护人员扩大系统规模或用更快的硬件替换基础时,不相称的扩展效应开始主导行为。同样,对这些影响的第一个反应通常是进行局部更改(有时称为补丁)来抵消它们,而不是进行需要更改多个模块或更改模块之间接口的根本设计更改。
In each of these cases, the ground rules and assumptions that the original designers used to develop the system begin to lose their relevance. The system designers may have foreseen some environmental changes, but there were other changes they probably did not anticipate. As changes to meet unforeseen requirements occur, they usually add complexity. Because it can be difficult to change the architecture of a deployed system (Section 1.3 explains why), there is a powerful incentive to make changes within the existing architecture, whether or not that is the best thing to do. Propagation of effects can amplify the problems caused by change because more distant effects of a change may not be noticed until someone invokes some rarely used feature. When those distant effects finally do surface, the maintainer may again find it easiest to deal with them locally, perhaps by adding exceptions. Incommensurate scaling effects begin to dominate behavior when a later maintainer scales a system up in size or replaces the underpinnings with faster hardware. Again, the first response to these effects is usually to make local changes (sometimes called patches) to counteract them rather than to make fundamental changes in design that would require changing several modules or changing interfaces between modules.
人们总是发现,一个有效的复杂系统都是从一个有效的简单系统发展而来的。
A complex system that works is invariably found to have evolved from a simple system that works.
— 约翰·加尔,《系统学》(1975)
— John Gall, Systemantics (1975)
一个密切相关的问题是,随着时间的推移,系统的复杂性不断增加,即使是最简单的更改(例如修复错误)也有可能引入另一个错误,因为复杂性往往会掩盖修复的全部影响。旧系统中的一个常见现象是,错误修复版本引入的错误数量可能超过该版本修复的错误数量。*
A closely related problem is that as systems grow in complexity with the passage of time, even the simplest change, such as to repair a bug, has an increasing risk of introducing another bug because complexity tends to obscure the full impact of the repair. A common phenomenon in older systems is that the number of bugs introduced by a bug fix release may exceed the number of bugs fixed by that release.*
最重要的是,随着系统老化,它们会逐渐积累变化,使其变得更加复杂。系统的寿命通常受其复杂性的限制,复杂性会随着系统的发展越来越偏离其原始设计而不断积累。
The bottom line is that as systems age, they tend to accumulate changes that make them more complex. The lifetime of a system is usually limited by the complexity that accumulates as it evolves farther and farther from its original design.
一个需求本身往往是复杂性的一个特定来源。它始于对高性能或高效率的渴望。每当涉及稀缺资源时,就会努力保持其高利用率。
One requirement by itself is frequently a specific source of complexity. It starts with a desire for high performance or high efficiency. Whenever a scarce resource is involved, an effort arises to keep its utilization high.
例如,假设一条单轨铁路穿过一个狭长的峡谷。†为了提高单轨铁路的利用率,并增加客流量,人们可以在峡谷中途的宽阔处安装一个道岔和一条短侧轨,让火车同时双向行驶。然后,如果人们在调度方面小心谨慎,相反方向的火车就会在侧轨相遇,在那里它们可以互相通过,有效地使轨道每天可以承载的火车数量翻倍。然而,现在的火车运营比以前复杂得多。如果任何一列火车晚点,两列火车的时间表都会被打乱。需要安装信号系统,因为人类调度员或操作员可能会犯错误。而且——一个突发特性——火车现在的长度是有限制的。如果两列火车要在中间通过,至少有一列必须足够短,才能完全驶上侧轨。
Consider, for example, a single-track railroad line running through a long, narrow canyon.† To improve the utilization of the single track, and push more traffic through, one might allow trains to run both ways at the same time by installing a switch and a short side track in a wide spot about halfway through the canyon. Then, if one is careful in scheduling, trains going in opposite directions will meet at the side track, where they can pass each other, effectively doubling the number of trains that the track can carry each day. However, the train operations are now much more complex than they used to be. If either train is delayed, the schedules of both are disrupted. A signaling system needs to be installed because human schedulers or operators may make mistakes. And—an emergent property—the trains now have a limit on their length. If two trains are to pass in the middle, at least one of them must be short enough to pull completely onto the side track.
Een schip op't droogh gezeylt, dat is een seeker baken.(一艘驶向陆地的船,是一座确定的灯塔。从别人的错误中吸取教训。)
Een schip op’t droogh gezeylt, dat is een seeker baken. (A ship, sailed on to dry land, that is a certain beacon. Learn from the mistakes of others.)
— 雅各布·卡特斯(Jacob Cats),《新旧时代之镜》(1632 年),取材自荷兰谚语
— Jacob Cats, Mirror on Old and New Times (1632), based on a Dutch proverb
峡谷中的火车很好地说明了提高利用率的努力如何增加复杂性。在努力提高利用率时,通常会遇到经济学家称之为的一般设计原则
The train in the canyon is a good illustration of how efforts to increase utilization can increase complexity. When striving for higher utilization, one usually encounters a general design principle that economists call
收益递减规律
The Law of Diminishing Returns
一个人在善良方面的进步越大,下一次进步所需的努力就越大。
The more one improves some measure of goodness, the more effort the next improvement will require.
这种现象在试图更有效地利用资源时尤为明显:人们越是试图充分地利用稀缺资源,使用、分配和分布的策略就越复杂。因此,一个很少使用的街道交叉口不需要任何交通管制,只需要规定右边的汽车有优先通行权。随着使用量的增加,必须逐步应用更复杂的措施:停车标志,然后是交通信号灯,然后用多相灯标记转弯车道,然后是控制信号灯的车辆传感器。当进出机场的交通量接近机场的容量时,必须采取措施,例如堆放飞机、将飞机停在远处的机场地面上或协调几家航空公司的航班调度。一般来说,人们越是试图提高有限资源的利用率,复杂性就越大(见图1.2)。
This phenomenon is particularly noticeable in attempts to use resources more efficiently: the more completely one tries to use a scarce resource, the greater the complexity of the strategies for use, allocation, and distribution. Thus a rarely used street intersection requires no traffic control beyond a rule that the car on the right has the right-of-way. As usage increases, one must apply progressively more complex measures: stop signs, then traffic lights, then marked turning lanes with multiphase lights, then vehicle sensors to control the lights. As traffic in and out of an airport nears the airport’s capacity, measures such as stacking planes, holding them on the ground at distant airports, or coordinated scheduling among several airlines must be taken. As a general rule, the more one tries to increase utilization of a limited resource, the greater the complexity (see Figure 1.2).
图 1.2收益递减的一个例子:复杂性随着利用率的提高而增加。
Figure 1.2 An example of diminishing returns: complexity grows with increasing utilization.
敏锐的读者会注意到图 1.1和1.2是相同的。记住这个图很有用,因为它的某个版本可用于描述系统的许多不同方面。
The perceptive reader will notice that Figures 1.1 and 1.2 are identical. It would be useful to memorize this figure because some version of it can be used to describe many different things about systems.
不可能预见到聪明的后果。
It is impossible to foresee the consequences of being clever.
— Christopher Strachey,罗杰·尼德汉姆报道
— Christopher Strachey, as reported by Roger Needham
正如人们所料,随着许多领域贡献出具有常见问题和复杂性来源的系统示例,一些应对复杂性的常用技术也应运而生。这些技术可以大致分为四大类:模块化、抽象、分层和层次结构。以下各节概述了每种技术的一般方法。在后面的章节中,将出现每种技术的许多示例。只有通过研究这些示例,它们的价值才会变得清晰。
As one might expect, with many fields contributing examples of systems with common problems and sources of complexity, some common techniques for coping with complexity have emerged. These techniques can be loosely divided into four general categories: modularity, abstraction, layering, and hierarchy. The following sections sketch the general method of each of the techniques. In later chapters many examples of each technique will emerge. It is only by studying those examples that their value will become clear.
降低复杂性的最简单、最重要的工具是分而治之技术:将系统分析或设计为一组相互作用的子系统,称为模块。这种技术的强大之处在于能够考虑模块内组件之间的相互作用,而无需同时考虑其他模块内的组件。
The simplest, most important tool for reducing complexity is the divide-and-conquer technique: analyze or design the system as a collection of interacting subsystems, called modules. The power of this technique lies primarily in being able to consider interactions among the components within a module without simultaneously thinking about the components that are inside other modules.
要了解减少交互的影响,请考虑调试一个包含N条语句的大型程序。假设程序中的错误数量与其大小成正比,并且错误随机分布在整个代码中。程序员编译程序、运行程序、发现错误、查找并修复错误,然后重新编译,然后再寻找下一个错误。还假设查找程序中的错误所需的时间大致与程序的大小成正比。然后我们可以对调试所花费的时间进行建模:
To see the impact of reducing interactions, consider the debugging of a large program with, say, N statements. Assume that the number of bugs in the program is proportional to its size and the bugs are randomly distributed throughout the code. The programmer compiles the program, runs it, notices a bug, finds and fixes the bug, and recompiles before looking for the next bug. Assume also that the time it takes to find a bug in a program is roughly proportional to the size of the program. We can then model the time spent debugging:
不幸的是,调试时间与程序大小的平方成正比。
Unfortunately, the debugging time grows proportional to the square of the program size.
现在假设程序员将程序分成K 个模块,每个模块的大小大致相等,因此每个模块包含N/K 个语句。如果模块实现独立功能,人们希望发现错误通常只需要检查一个模块。这样,调试任何一个模块所需的时间就以两种方式减少了:较小的模块可以更快地调试,而且由于较小的程序中的错误较少,任何一个模块都不需要调试那么多次。这两种影响被调试所有 K 个模块的需要部分抵消。因此,我们调试K 个模块系统所需时间的模型变为
Now suppose that the programmer divides the program into K modules, each of roughly equal size, so that each module contains N/K statements. To the extent that the modules implement independent features, one hopes that discovery of a bug usually will require examining only one module. The time required to debug any one module is thus reduced in two ways: the smaller module can be debugged faster, and since there are fewer bugs in smaller programs, any one module will not need to be debugged as many times. These two effects are partially offset by the need to debug all K modules. Thus our model of the time required to debug the system of K modules becomes
计划扔掉其中一个;无论如何,你都会这么做。
Plan to throw one away; you will, anyhow.
— Frederick P. Brooks,《人月神话》(1974)
— Frederick P. Brooks, The Mythical Man Month (1974)
因此,模块化为K 个组件可将调试时间缩短K倍。尽管模块化减少工作量的具体机制因系统而异,但模块化的这一特性是通用的。因此,每个大型系统都具有模块化。
Modularization into K components thus reduces debugging time by a factor of K. Although the detailed mechanism by which modularity reduces effort differs from system to system, this property of modularity is universal. For this reason, one finds modularity in every large system.
我们在这里利用的模块化特点是,用改进的模块替换劣质模块很容易,从而允许逐步改进系统而无需完全重建系统。因此,模块化有助于控制由变化引起的复杂性。此功能不仅适用于调试,还适用于系统改进和演进的所有方面。同时,重要的是要认识到与模块化相关的设计原则,我们可以称之为
The feature of modularity that we are taking advantage of here is that it is easy to replace an inferior module with an improved one, thus allowing incremental improvement of a system without completely rebuilding it. Modularity thus helps control the complexity caused by change. This feature applies not only to debugging but to all aspects of system improvement and evolution. At the same time, it is important to recognize a design principle associated with modularity, which we may call
坚定的基础规则
The Unyielding Foundations Rule
改变一个模块比改变模块化更容易。
It is easier to change a module than to change the modularity.
原因是,一旦某个接口已被其他模块使用,更改该接口至少需要替换两个模块。如果一个接口被许多模块使用,则更改该接口需要同时替换所有这些模块。因此,确保模块化的正确性尤为重要。
The reason is that once an interface has been used by another module, changing the interface requires replacing at least two modules. If an interface is used by many modules, changing it requires replacing all of those modules simultaneously. For this reason, it is particularly important to get the modularity right.
已经有整本书专门讲述模块化及其带来的好处。侧边栏 1.5介绍了其中一本书。
Whole books have been written about modularity and the good things it brings. Sidebar 1.5 describes one of those books.
侧边栏 1.5 模块化如何重塑计算机行业
Sidebar 1.5 How Modularity Reshaped the Computer Industry
哈佛商学院的两位教授卡利斯·鲍德温和金·克拉克曾写过一本关于模块化的书。*书中讨论了很多内容,但最有趣的内容之一就是它对计算机行业一次重大转变的解释。20 世纪 60 年代,计算机系统是一个垂直整合的行业。也就是说,IBM、Burroughs、Honeywell 和其他几家公司各自提供自上而下的系统和支持,提供处理器、内存、存储、操作系统、应用程序、销售和维护;IBM 甚至生产自己的芯片。到了 20 世纪 90 年代,这个行业已经转变为一个水平组织的行业,其中英特尔销售处理器,美光销售内存,希捷销售磁盘,微软销售操作系统,Adobe 销售文本和图像应用程序,甲骨文销售数据库系统,而 Gateway 和戴尔则使用其他参与者提供的组件组装被称为“计算机”的机器。
Two Harvard Business School professors, Carliss Baldwin and Kim Clark, have written a whole book about modularity.* It discusses many things, but one of the most interesting is its explanation of a major transition in the computer business. In the 1960s, computer systems were a vertically integrated industry. That is, IBM, Burroughs, Honeywell, and several others each provided top-to-bottom systems and support, offering processors, memory, storage, operating systems, applications, sales, and maintenance; IBM even manufactured its own chips. By the 1990s, the industry had transformed into a horizontally organized one in which Intel sells processors, Micron sells memory, Seagate sells disks, Microsoft sells operating systems, Adobe sells text and image applications, Oracle sells database systems, and Gateway and Dell assemble boxes called “computers” out of components provided by the other players.
Carliss Baldwin 和 Kim Clark 将这一转变解释为模块化应用的一个例子。创建垂直整合产品线的公司立即发现复杂性正在肆虐,他们得出结论,控制复杂性的唯一有效方法是模块化他们的产品。在对错误的模块化进行几次实验之后(IBM 最初为商业和科学应用设计了不同的计算机),他们最终找到了有效的拆分方法,从而控制了开发成本和交付时间表:
Carliss Baldwin and Kim Clark explain this transition as an example of modularity in action. The companies that created vertically integrated product lines immediately found complexity running amok, and they concluded that the only effective way to control it was to modularize their products. After a few experiments with wrong modularities (IBM originally designed different computers for business and for scientific applications), they eventually hit on effective ways of splitting things up and thereby keeping their development costs and delivery schedules under control:
IBM 开发了 System/360 架构规范,该规范可适用于性能范围广泛的机器。这种模块化允许任何软件在任何大小的处理器上运行。IBM 还开发了标准 I/O 总线和磁盘接口,因此 IBM 制造的任何 I/O 设备或磁盘都可以连接到任何 IBM 计算机。
IBM developed the System/360 architecture specification, which could apply to machines of widely ranging performance. This modularity allowed any software to run on any size processor. IBM also developed a standard I/O bus and disk interface, so that any I/O device or disk manufactured by IBM could be attached to any IBM computer.
数字设备公司开发了 PDP-11 系列,随着技术的进步,该系列产品的价格可以降低至 PDP-11/03,而功能可以提升至 PDP-11/70。针对小型机器上缺少的硬件指令,采用硬件辅助仿真策略,允许为任何机器编写的应用程序在该系列的任何其他机器上运行。数字设备公司还开发了一种 I/O 架构 UNIBUS ®,允许任何 I/O 设备连接到任何 PDP-11 型号。
Digital Equipment Corporation developed the PDP–11 family, which, with improving technology, could simultaneously be driven down in price toward the PDP–11/03 and up in function toward the PDP–11/70. A hardware-assisted emulation strategy for missing hardware instructions on the smaller machines allowed applications written for any machine to run on any other machine in the family. Digital also developed an I/O architecture, the UNIBUS®, that allowed any I/O device to attach to any PDP–11 model.
长期来看,一旦这种模块化被定义并被证明是有效的,其他供应商就可以加入进来,将每个模块变成一个独特的业务。结果就是自 20 世纪 90 年代以来,计算机行业呈现出明显的横向发展趋势,尤其是考虑到它与 20 年前的形态截然不同。
The long-range result was that once this modularity was defined and proven to be effective, other vendors were able to jump in and turn each module into a distinct business. The result is the computer industry since the 1990s, which is remarkably horizontal, especially considering its rather different shape only 20 years earlier.
卡利斯·鲍德温和金·克拉克还观察到,更普遍的是,市场经济的特点是模块化。市场经济中没有一个自给自足的农场家庭,所有事情都自己做,而是有制桶匠、修补匠、铁匠、马厩工人、裁缝等等,每个人都在模块化专业领域中更具生产力,所有人都使用一个通用接口——金钱——相互出售物品。
Carliss Baldwin and Kim Clark also observe, more generally, that a market economy is characterized by modularity. Rather than having a self-supporting farm family that does everything for itself, a market economy has coopers, tinkers, blacksmiths, stables, dressmakers, and so on, each being more productive in a modular specialty, all selling things to one another using a universal interface—money.
* Carliss Y. Baldwin 和 Kim B. Clark。《设计规则:模块化的力量》 [见进一步阅读建议 1.3.7 ]。警告:作者使用“模块化”一词来表示模块化、抽象、分层和层次结构。
* Carliss Y. Baldwin and Kim B. Clark. Design Rules: The Power of Modularity [see Suggestions for Further Reading 1.3.7]. Warning: the authors use the word “modularity” to mean all of modularity, abstraction, layering, and hierarchy.
在模块化对调试时间的影响的数值示例中,一个重要的假设在实践中可能不成立:发现错误通常只会导致检查一个模块。要使该假设成立,还有一个要求:从一个模块到另一个模块的影响必须很少或没有传播。虽然有很多方法可以将系统划分为模块,但其中一些方法将被证明比其他方法更好——“根据自然形成,连接的位置,不要像一个糟糕的雕刻家那样破坏任何部分”(柏拉图,《斐德罗篇》265e,本杰明·乔伊特译本)。
An important assumption in the numerical example of the effect of modularity on debugging time may not hold up in practice: that discovery of a bug should usually lead to examining just one module. For that assumption to hold true, there is a further requirement: there must be little or no propagation of effects from one module to another. Although there are lots of ways of dividing a system up into modules, some of these ways will prove to be better than others—“according to the natural formation, where the joint is, not breaking any part as a bad carver might” (Plato, Phaedrus 265e, Benjamin Jowett translation).
计算的目的是洞察,而不是数字。
The purpose of computing is insight, not numbers.
— Richard W. Hamming,《科学家和工程师的数值方法》(1962 年)
— Richard W. Hamming, Numerical Methods for Scientists and Engineers (1962)
因此,最好的划分通常遵循自然或有效的界限。它们的特点是模块之间的交互更少,并且从一个模块到另一个模块的影响传播更少。更一般地,它们的特点是任何模块能够完全根据其外部规范处理所有其他模块,而不需要了解内部发生的情况。对模块化的这种额外要求称为抽象。抽象是接口与内部、规范与实现的分离。由于抽象几乎总是伴随着模块化,所以有些作者并没有对这两个概念进行任何区分。有时,人们会看到功能模块化这个术语被用来表示具有抽象的模块化。
Thus the best divisions usually follow natural or effective boundaries. They are characterized by fewer interactions among modules and by less propagation of effects from one module to another. More generally, they are characterized by the ability of any module to treat all the others entirely on the basis of their external specifications, without need for knowledge about what goes on inside. This additional requirement on modularity is called abstraction. Abstraction is separation of interface from internals, of specification from implementation. Because abstraction nearly always accompanies modularity, some authors do not make any distinction between the two ideas. One sometimes sees the term functional modularity used to mean modularity with abstraction.
因此,人们购买 DVD 播放器时,会将其视为一个前面板上有十几个按钮的设备,并希望永远不要查看其内部。如果人们必须了解电视机内部设计的细节才能选择兼容的 DVD 播放器,那么没有人会购买该播放器。同样,人们将包裹交给隔夜快递服务,而不需要了解该服务将使用的具体车辆类型或路线。唯一需要担心的是包裹明天会送到。
Thus one purchases a DVD player planning to view it as a device with a dozen or so buttons on the front panel and hoping never to look inside. If one had to know the details of the internal design of a television set in order to choose a compatible DVD player, no one would ever buy the player. Similarly, one turns a package over to an overnight delivery service without feeling a need to know anything about the particular kinds of vehicles or routes the service will use. Confidence that the package will be delivered tomorrow is the only concern.
在计算机世界中,抽象以无数种方式出现。顺序电路记住状态的一般能力被抽象为特定的、易于描述的模块,称为寄存器。程序被设计为隐藏其复杂数据结构表示的细节以及它们调用的其他程序的细节。用户期望易于使用、按钮式的应用程序界面(如计算机游戏、电子表格程序或 Web 浏览器)能够抽象出内存、处理器、通信和显示管理等极其复杂的基础。
In the computer world, abstraction appears in countless ways. The general ability of sequential circuits to remember state is abstracted into particular, easy-to-describe modules called registers. Programs are designed to hide details of their representation of complex data structures and details of which other programs they call. Users expect easy-to-use, button-pushing application interfaces such as computer games, spreadsheet programs, or Web browsers that abstract incredibly complex underpinnings of memory, processor, communication, and display management.
必须记住,没有什么比创建新制度更难规划、更难成功、更难管理。因为发起者会遭到所有希望从保留旧制度中获益的人的敌意,而那些希望从新制度中获益的人只会冷淡地捍卫新制度。
It must be remembered that there is nothing more difficult to plan, more doubtful of success, nor more dangerous to manage than the creation of a new system. For the initiator has the enmity of all who would profit by the preservation of the old institutions and merely lukewarm defenders in those who would gain by the new ones.
— 尼科洛·马基雅维利,《君主论》(1513 年,出版于 1532 年;译者:托马斯·G·伯金,Appleton-Century-Crofts,1947 年)
— Niccolò Machiavelli, The Prince (1513, published 1532; Tr. by Thomas G. Bergin, Appleton-Century-Crofts, 1947)
如果由于实现错误而导致无意或意外的互连,甚至出于善意的设计试图越过模块边界以提高性能或满足其他要求,则最小化模块间互连的目标可能会失败。软件特别容易受到此问题的影响,因为单独编译的子程序提供的模块边界有些软,并且很容易被使用指针、填充缓冲区或计算数组索引时的错误所穿透。因此,系统设计人员更喜欢通过在模块之间插入坚不可摧的墙壁来强制模块化的技术。这些技术确保不会出现无意或隐藏的互连。第4 章和第 5 章介绍了一些用于强制模块化的技术。
The goal of minimizing interconnections among modules may be defeated if unintentional or accidental interconnections occur as a result of implementation errors or even well-meaning design attempts to sneak past modular boundaries in order to improve performance or meet some other requirement. Software is particularly subject to this problem because the modular boundaries provided by separately compiled subprograms are somewhat soft and easily penetrated by errors in using pointers, filling buffers, or calculating array indices. For this reason, system designers prefer techniques that enforce modularity by interposing impenetrable walls between modules. These techniques ensure that there can be no unintentional or hidden interconnections. Chapters 4 and 5 develop some of these techniques for enforcing modularity.
设计良好且正确执行的模块化抽象对于限制故障的影响尤其重要,因为它们可以控制影响的传播。正如我们在第 8 章 [在线] 中学习容错时所看到的那样,模块是故障遏制的单位,故障的定义是模块不符合其抽象接口规范。
Well-designed and properly enforced modular abstractions are especially important in limiting the impact of faults because they control propagation of effects. As we shall see when we study fault tolerance in Chapter 8 [on-line], modules are the units of fault containment, and the definition of a failure is that a module does not meet its abstract interface specifications.
与抽象密切相关的是使模块化在实践中发挥作用的重要设计规则:
Closely related to abstraction is an important design rule that makes modularity work in practice:
稳健性原则
The Robustness Principle
对输入要宽容,对输出要严格。
Be tolerant of inputs and strict on outputs.
这一原则意味着,模块在设计时应尽量自由地解释其输入值,即使输入值不在指定范围内,只要能够合理地解释它们,就应接受它们。另一方面,模块应根据其规范保守地构建其输出 — 如果可能的话,使其输出比规范要求的更准确或更受约束。稳健性原则的效果是倾向于抑制而不是传播甚至放大模块间接口中出现的噪声或错误。
This principle means that a module should be designed to be liberal in its interpretation of its input values, accepting them even if they are not within specified ranges, if it is still apparent how to sensibly interpret them. On the other hand, the module should construct its outputs conservatively in accordance with its specification—if possible making them even more accurate or more constrained than the specification requires. The effect of the robustness principle is to tend to suppress, rather than propagate or even amplify, noise or errors that show up in the interfaces between modules.
稳健性原则是现代大规模生产的核心思想之一。从历史上看,机械师制造需要配合的部件时,首先加工一个部件,然后再加工第二个部件,使其与第一个部件精确贴合,这种技术称为配合。这一突破源于人们的认识:如果指定了部件的公差,并将每个部件设计为与任何在其指定公差范围内的其他部件配合,那么就有可能通过使用可互换零件来实现模块化并加快制造速度。显然,这一概念首次成功应用于 1822 年的一份向美国陆军交付步枪的合同。在 T 型车生产线创建时,亨利·福特用这句格言概括了这一概念:“大规模生产中没有装配工。”
The robustness principle is one of the key ideas underlying modern mass production. Historically, machinists made components that were intended to mate by machining one of the components and then machining a second component to exactly fit against or into the first one, a technique known as fitting. The breakthrough came with the realization that if one specified tolerances for components and designed each component to mate with any other component that was within its specified tolerance, then it would be possible to modularize and speed up manufacturing by having interchangeable parts. Apparently, this concept was first successfully applied in an 1822 contract to deliver rifles to the United States Army. By the time production lines for the Model T automobile were created, Henry Ford captured the concept in the aphorism, “In mass production there are no fitters.”
我们面临着难以克服的机遇。
We are faced with an insurmountable opportunity.
— 波戈(Walt Kelley)
— Pogo (Walt Kelley)
稳健性原则在计算机系统中发挥着重要作用。它在人机界面、网络协议和容错方面尤其重要,并且正如本章1.4 节所解释的那样,它构成了数字逻辑的基础。同时,稳健性原则与另一个重要的设计原则之间存在着矛盾:
The robustness principle plays a major role in computer systems. It is particularly important in human interfaces, network protocols, and fault tolerance, and, as Section 1.4 of this chapter explains, it forms the basis for digital logic. At the same time, a tension exists between the robustness principle and another important design principle:
安全边际原则
The Safety Margin Principle
注意与悬崖的距离,否则你可能会掉下悬崖。
Keep track of the distance to the cliff, or you may fall over the edge.
当输入不接近其指定值时,通常表明某些事情开始出错。越早发现问题,就能越早修复。因此,跟踪和报告超出容差范围的输入非常重要,即使稳健性原则允许成功解释这些输入。
When inputs are not close to their specified values, that is usually an indication that something is starting to go wrong. The sooner that something going wrong can be noticed, the sooner it can be fixed. For this reason, it is important to track and report out-of-tolerance inputs, even if the robustness principle would allow them to be interpreted successfully.
一些系统通过提供两种操作模式来实现安全边际原则,这两种模式可以称为“震荡”和“生产”。在震荡模式下,模块会仔细检查每个输入,并拒绝接受任何稍微超出规格的内容,从而可以立即发现问题和接近源头的编程错误。在生产模式下,模块会根据稳健性原则接受任何它们可以合理解释的输入。精心设计的系统将这两种想法融合在一起:接受任何合理的输入,但报告任何开始超出容差的输入,以便在完全无法使用之前对其进行修复。
Some systems implement the safety margin principle by providing two modes of operation, which might be called “shake-out” and “production”. In shake-out mode, modules check every input carefully and refuse to accept anything that is even slightly out of specification, thus allowing immediate discovery of problems and of programming errors near their source. In production mode, modules accept any input that they can reasonably interpret, in accordance with the robustness principle. Carefully designed systems blend the two ideas: accept any reasonable input but report any input that is beginning to drift out of tolerance so that it may be repaired before it becomes completely unusable.
使用良好抽象设计的系统往往会最大限度地减少其组件模块之间的互连数量。减少模块互连的一个有效方法是采用一种特殊的模块组织方法,称为分层。在使用分层设计时,人们以一组已经完成的机制(下层)为基础,并使用它们来创建另一组完整的机制(上层)。一个层本身可以实现为多个模块,但一般来说,给定层的模块只与同一层中的对等模块以及下一较高层和下一较低层的模块交互。这种限制可以显著减少大系统中潜在的模块间交互数量。
Systems that are designed using good abstractions tend to minimize the number of interconnections among their component modules. One powerful way to reduce module interconnections is to employ a particular method of module organization known as layering. In designing with layers, one builds on a set of mechanisms that is already complete (a lower layer) and uses them to create a different complete set of mechanisms (an upper layer). A layer may itself be implemented as several modules, but as a general rule, a module of a given layer interacts only with its peers in the same layer and with the modules of the next higher and next lower layers. That restriction can significantly reduce the number of potential intermodule interactions in a big system.
对一个大系统来说,不存在什么小改动。
There is no such thing as a small change to a large system.
— 系统民间传说,来源已消失在时间的迷雾中
— systems folklore, source lost in the mists of time
这种方法的一些最佳示例可以在计算机系统中找到:高级语言的解释器是使用较低级、更面向机器的语言实现的。尽管高级语言不允许表达任何新程序,但它更易于使用,至少对于它设计的应用程序而言。
Some of the best examples of this approach are found in computer systems: an interpreter for a high-level language is implemented using a lower-level, more machine-oriented, language. Although the higher-level language doesn’t allow any new programs to be expressed, it is easier to use, at least for the application for which it was designed.
因此,几乎每个计算机系统都包含多个层。最底层由门和存储单元组成,在此层之上构建了由处理器和内存组成的层。在此层之上构建了一个操作系统层,它充当处理器和内存层的增强。最后,应用程序在此增强的处理器和内存层上执行。在每一层中,下一层提供的功能都会根据上一层的方便进行重新排列、重新打包、重新抽象和重新解释。如第 7 章 [在线] 中所述,层也是数据通信网络的主要组织技术。
Thus, nearly every computer system comprises several layers. The lowest layer consists of gates and memory cells, upon which is built a layer consisting of a processor and memory. On top of this layer is built an operating system layer, which acts as an augmentation of the processor and memory layer. Finally, an application program executes on this augmented processor and memory layer. In each layer, the functions provided by the layer below are rearranged, repackaged, reabstracted, and reinterpreted as appropriate for the convenience of the layer above. As will be seen in Chapter 7 [on-line], layers are also the primary organizing technique of data communication networks.
分层设计并非计算机系统和通信所独有。房子的内部结构层由立柱、托梁和椽子组成,以提供形状和强度;一层护套和石膏板以防风;一层墙板、地板和屋顶瓦片以防水;还有一层装饰性油漆以使其美观。许多数学,特别是代数,都是以优雅的层次结构组织的(就代数而言,是整数、有理数、复数、多项式和具有多项式系数的多项式),这种组织结构是深入理解的关键。
Layered design is not unique to computer systems and communications. A house has an inner structural layer of studs, joists, and rafters to provide shape and strength, a layer of sheathing and drywall to keep the wind out, a layer of siding, flooring and roof tiles to make it watertight, and a cosmetic layer of paint to make it look good. Much of mathematics, particularly algebra, is elegantly organized in layers (in the case of algebra, integers, rationals, complex numbers, polynomials, and polynomials with polynomial coefficients), and that organization provides a key to deep understanding.
应对复杂性的最后一种主要技术也是减少模块之间的互连,但是采用一种不同的、专门的方法。从一小组模块开始,将它们组装成一个稳定、自足且具有明确定义接口的子系统。接下来,组装一小组子系统以生成更大的子系统。这个过程一直持续,直到由少量相对较大的子系统构建出最终系统。结果是一个树状结构,称为层次结构。大型组织(例如公司)几乎总是以这种方式建立,一个经理只负责 5 到 10 名员工,更高级别的经理负责 5 到 10 名经理,一直到公司总裁,他可能要监督 5 到 10 名副总裁。同样的思路也适用于军队。即使是层也可以被认为是一种退化的一维层次结构。
The final major technique for coping with complexity also reduces interconnections among modules but in a different, specialized way. Start with a small group of modules, and assemble them into a stable, self-contained subsystem that has a well-defined interface. Next, assemble a small group of subsystems to produce a larger subsystem. This process continues until the final system has been constructed from a small number of relatively large subsystems. The result is a tree-like structure known as a hierarchy. Large organizations such as corporations are nearly always set up this way, with a manager responsible for only five to ten employees, a higher-level manager responsible for five to ten managers, on up to the president of the company, who may supervise five to ten vice presidents. The same thinking applies to armies. Even layers can be thought of as a kind of degenerate one-dimensional hierarchy.
项目的前 80% 需要投入 80% 的努力。
The first 80 percent of a project takes 80 percent of the effort.
最后 20% 又占了 80%。
The last 20 percent takes another 80.
— 来源不明
— source unknown
层次结构还有很多其他引人注目的例子,从微观生物系统到亚历山大帝国的集结。赫伯特·西蒙的经典论文《复杂性的架构》[进一步阅读建议 1.4.3 ] 包含了大量此类例子,并提出了令人信服的论据,即在进化过程中,层次结构设计有更好的生存机会。原因是层次结构通过只允许子系统组件之间的交互来限制交互。层次结构限制了一个由N 个组件组成的系统,在最坏的情况下可能会出现N × ( N - 1) 个交互,因此每个组件只能与其自己子系统的成员交互,除了接口组件,该组件还与下一个更高层次的子系统的其他成员交互。(公司中的接口组件称为“经理”;军队中称为“指挥官”;程序中称为“应用程序编程接口”。)如果子系统的组件数量限制为 10 个,则无论系统规模如何增长,这个数字都保持不变。最低级别子系统有N /10 个,下一个更高级别子系统有N /100 个,依此类推,但子系统总数以及交互次数仍与N成比例。与模块化减少调试工作量的方式类似,层次结构将模块间潜在交互的数量从平方律减少到线性。
There are many other striking examples of hierarchy, ranging from microscopic biological systems to the assembly of Alexander’s empire. A classic paper by Herbert Simon, “The architecture of complexity” [Suggestions for Further Reading 1.4.3], contains an amazing range of such examples and offers compelling arguments that, under evolution, hierarchical designs have a better chance of survival. The reason is that hierarchy constrains interactions by permitting them only among the components of a subsystem. Hierarchy constrains a system of N components, which in the worst case might exhibit N × (N − 1) interactions, so that each component can interact only with members of its own subsystem, except for an interface component that also interacts with other members of the subsystem at the next higher level of hierarchy. (The interface component in a corporation is called a “manager” in an army it is called the “commanding officer” for a program it is called the “application programming interface”.) If subsystems have a limit of, say, 10 components, this number remains constant no matter how large the system grows. There will be N/10 lowest level subsystems, N/100 next higher level subsystems, and so on, but the total number of subsystems, and thus the number of interactions, remains proportional to N. Analogous to the way that modularity reduces the effort of debugging, hierarchy reduces the number of potential interactions among modules from square-law to linear.
单个模块的设计者最能感受到这种影响。如果没有限制,原则上每个模块都应准备好与系统的其他每个模块进行交互。层次结构的优点是模块设计者可以只关注与其直接子系统的其他成员的接口的交互。
This effect is most strongly noticed by the designer of an individual module. If there are no constraints, each module should in principle be prepared to interact with every other module of the system. The advantage of a hierarchy is that the module designer can focus just on interactions with the interfaces of other members of its immediate subsystem.
应对复杂性的四种技术——模块化、抽象化、分层和层次结构——提供了划分事物并将结果模块以适当的关系放置的方法。但是,我们仍然需要一种连接这些模块的方法。在数字系统中,主要的连接方法是让一个模块命名它打算使用的另一个模块。名称允许推迟决策、轻松地用更好的模块替换一个模块以及共享模块。软件以明显的方式使用名称。不太明显的是,连接到总线的硬件模块也使用名称进行互连——地址(包括总线地址)是一种名称。
The four techniques for coping with complexity—modularity, abstraction, layering, and hierarchy—provide ways of dividing things up and placing the resulting modules in suitable relation one to another. However, we still need a way of connecting those modules. In digital systems, the primary connection method is that one module names another module that it intends to use. Names allow postponing of decisions, easy replacement of one module with a better one, and sharing of modules. Software uses names in an obvious way. Less obviously, hardware modules connected to a bus also use names for interconnection—addresses, including bus addresses, are a kind of name.
霍夫施塔特定律:即使考虑到霍夫施塔特定律,事情总是比你预期的要花更长的时间。
Hofstadter’s Law: It always takes longer than you expect, even when you take into account Hofstadter’s Law.
— 道格拉斯·霍夫施塔特:《哥德尔、埃舍尔、巴赫:永恒的金辫》(1979)
— Douglas Hofstadter: Gödel, Escher, Bach: An Eternal Golden Braid (1979)
在模块化系统中,通常可以找到多种方法来组合模块以实现所需的功能。设计人员必须在某个时候从众多可用实现中选择一种特定的实现。做出这种选择称为绑定。回想一下,模块化的强大之处在于能够用更好的实现替换一种实现,设计人员通常会尝试通过将绑定延迟到最后可能的时刻,甚至延迟到实际需要该功能的第一个时刻,来保持最大的灵活性。
In a modular system, one can usually find several ways to combine modules to implement a desired feature. The designer must at some point choose a specific implementation from among many that are available. Making this choice is called binding. Recalling that the power of modularity comes from the ability to replace an implementation with a better one, the designer usually tries to maintain maximum flexibility by delaying binding until the last possible instant, perhaps even until the first instant that the feature is actually needed.
延迟绑定的一种方法是只命名功能而不是实现它。使用名称允许人们将模块设计为好像另一个模块的功能存在一样,即使该功能尚未实现,并且它还使得以后选择不同的实现变得非常容易。当实际调用该功能时,名称当然必须绑定到另一个模块的实际实现。使用名称来延迟或允许更改绑定称为间接,它是设计原则的基础:
One way to delay binding is just to name a feature rather than implementing it. Using a name allows one to design a module as if a feature of another module exists, even if that feature has not yet been implemented, and it also makes it mechanically easy to later choose a different implementation. By the time the feature is actually invoked, the name must, of course, be bound to a real implementation of the other module. Using a name to delay or allow changing a binding is called indirection, and it is the basis of a design principle:
使用间接方式解耦模块
Decouple Modules with Indirection
间接支持可替换性。
Indirection supports replaceability.
剑桥大学计算机科学家戴维·惠勒 (David Wheeler) 提出了这一原则的民间智慧版本,他夸大了间接层的力量,认为“计算机系统中的任何问题都可以通过增加一个间接层来解决”。这一民间智慧的一个更合理的版本是,任何计算机系统都可以通过删除一个间接层来提高速度。
A folk wisdom version of this principle, attributed to computer scientist David Wheeler of the University of Cambridge, exaggerates the power of indirection by suggesting that “any problem in a computer system can be solved by adding a layer of indirection.” A somewhat more plausible counterpart of this folk wisdom is the observation that any computer system can be made faster by removing a layer of indirection.
当一个模块有名称时,其他几个模块可以通过名称使用它,从而共享第一个模块中包含的设计工作、成本或信息。由于名称是数字系统中模块化的基石元素,因此第 2 章和第3章主要介绍命名方案的设计。
When a module has a name, several other modules can make use of it by name, thereby sharing the design effort, cost, or information contained in the first module. Because names are a cornerstone element of modularity in digital systems, Chapters 2 and 3 are largely about the design of naming schemes.
正如我们反复提到的,从迄今为止用来说明系统问题的大量例子中,我们可以得出一个重要的教训。某些常见问题出现在所有复杂系统中,无论其领域如何。在空间站设计、经济管理、摩天大楼建设、基因拼接、石油精炼厂、通信卫星网络和印度治理等各种活动中,以及在计算机系统的设计中,都会考虑突发特性、影响传播、不相称的扩展和权衡。此外,为应对复杂性而设计的技术是通用的。模块化、抽象、分层和层次结构是处理复杂系统的大多数领域中使用的工具。因此,对于计算机系统设计人员来说,研究其他领域的系统很有用,既可以获得有关系统问题如何产生的额外视角,也可以发现其他领域中可能也适用于计算机系统的特定技术。简而言之,我们得出的结论是,计算机系统与所有其他系统相同。
As we have repeatedly suggested, there is an important lesson to be drawn from the wide range of examples used up to this point to illustrate system problems. Certain common problems show up in all complex systems, whatever their field. Emergent properties, propagation of effects, incommensurate scaling, and trade-offs are considerations in activities as diverse as space station design, management of the economy, the building of skyscrapers, gene-splicing, petroleum refineries, communication satellite networks, and the governing of India, as well as in the design of computer systems. Furthermore, the techniques that have been devised for coping with complexity are universal. Modularity, abstraction, layering, and hierarchy are used as tools in most fields that deal with complex systems. It is therefore useful for the computer system designer to investigate systems from other fields, both to gain additional perspective on how system problems arise and to discover specific techniques from other fields that may also apply to computer systems. Stated briefly, we conclude that computer systems are the same as all other systems.
一个系统在停止使用之前永远不会完成开发。
A system is never finished being developed until it ceases to be used.
— 归功于 Gerald M. Weinberg
— attributed to Gerald M. Weinberg
但这个结论有一个问题:它是错误的。计算机系统与设计师所体验过的所有其他类型的系统至少有两个显著的不同之处:
But there is one problem with that conclusion: it is wrong. There are at least two significant ways in which computer systems differ from every other kind of system with which designers have experience:
The complexity of a computer system is not limited by physical laws.
The rate of change of computer system technology is unprecedented.
这两个差异对复杂性以及应对复杂性的方式产生了巨大的影响。
These two differences have an enormous impact on complexity and on ways of coping with it.
计算机系统大多是数字化的,由软件控制。这两个特性分别导致其他系统中物理定律对复杂性的限制放宽。
Computer systems are mostly digital, and they are controlled by software. Each of these two properties separately leads to relaxations of what, in other systems, would be limits on complexity arising from physical laws.
首先考虑模拟系统和数字系统之间的差异。所有模拟系统都存在工程限制,即系统的每个组件都会产生噪声。这种噪声可能来自环境,例如振动或电磁辐射。噪声也可能出现,因为组件的物理行为并不严格遵循任何可处理的运行模型:土木工程师指定放在桥台下的一堆岩石不遵循简单的变形模型;电子电路中的电阻器会产生随机噪声,其水平取决于温度。当模拟组件组成系统时,来自各个组件的噪声会累积(如果噪声源在统计上独立,噪声可能只会缓慢累积,但仍然会累积)。随着组件数量的增加,噪声将在某个时候主导系统的行为。(此分析适用于人类工程师设计的系统。自然生物、热力学和宏观经济系统由数十亿个模拟组件组成,它们以某种方式使用层次结构、分层、抽象和模块化来运行,尽管存在噪声,但它们非常复杂,我们对它们的了解还不足以采用相同的技术。)
Consider first the difference between analog and digital systems. All analog systems have the engineering limitation that each component of the system contributes noise. This noise may come from the environment in the form of, for example, vibration or electromagnetic radiation. Noise may also appear because the component’s physical behavior does not precisely follow any tractable model of operation: the pile of rocks that a civil engineer specifies to go under a bridge abutment does not obey a simple deformation model; a resistor in an electronic circuit generates random noise whose level depends on the temperature. When analog components are composed into systems, the noise from individual components accumulates (if the noise sources are statistically independent, the noise may accumulate only slowly but it still accumulates). As the number of components increases, noise will at some point dominate the behavior of the system. (This analysis applies to systems designed by human engineers. Natural biological, thermodynamic, and macroeconomic systems, composed of billions of analog components, somehow use hierarchy, layering, abstraction, and modularity to operate despite noise, but they are so complex that we do not understand them well enough to adopt the same techniques.)
我后来才知道,我们倾向于通过重组来应对任何新情况;这是一种多么奇妙的方法,它可以创造进步的假象,同时却产生混乱、低效率和士气低落。
I was to learn later in life that we tend to meet any new situation by reorganisation; and what a wonderful method it can be for creating the illusion of progress while producing confusion, inefficiency and demoralisation.
— 查尔顿·奥格本 (Charlton Ogburn) 的观察简略版,《梅里尔的掠夺者:关于一次不可思议的冒险的真相》,《哈泼斯杂志》(1957 年 1 月)。普遍但不太可能地被误认为是佩特罗尼乌斯·阿比特 (Petronius Arbiter) 所作(约公元60 年)
— shortened version of an observation by Charlton Ogburn, “Merrill’s Marauders: The truth about an incredible adventure”, Harper’s Magazine (January 1957). Widely but improbably misattributed to Petronius Arbiter (ca. A.D. 60)
因此,噪声限制了设计师可以有效组合的模拟元件数量,或设计师可以有效级联的级数。这一论点适用于任何工程模拟系统:横跨河流的桥梁、立体声音响或客机。这就是复印件比原件更难读取的原因。尺寸上可能还有其他限制(例如,由于材料强度),但噪声始终是模拟系统复杂性的限制。
Noise thus provides a limit on the number of analog components that a designer can usefully compose or on the number of stages that a designer can usefully cascade. This argument applies to any engineered analog system: a bridge across a river, a stereo, or an airliner. It is the reason a photocopy of a photocopy is harder to read than the original. There may also be other limits on size (arising from the strength of materials, for example), but noise is always a limit on the complexity of analog systems.
相比之下,数字系统没有噪声;因此复杂性可以增长,而不受噪声限制。数字逻辑设计人员使用一种称为静态原则的稳健性原则。这种原则是数字系统魔力的主要来源。静态原则要求设备接受的数字值ONE(或ZERO)的模拟值范围要比设备输出的数字值 ONE(或 ZERO )的模拟值范围更宽。这种原则是对输入宽容、对输出严格的一个例子。
In contrast, digital systems are noise-free; complexity can therefore grow without any constraint of a bound arising from noise. The designers of digital logic use a version of the robustness principle known as the static discipline. This discipline is the primary source of the magic that seems to surround digital systems. The static discipline requires that the range of analog values that a device accepts as meaning the digital value ONE (or ZERO) be wider than the range of analog values that the device puts out when it means digital ONE (or ZERO). This discipline is an example of being tolerant of inputs and strict on outputs.
数字系统在较低层次上由模拟组件构成。为此目的而选择的模拟组件是非线性的,它们在输入和输出之间具有增益。如果使用得当,非线性可使输入具有较大的公差,而增益可确保输出保持在较窄的规格范围内,如图1.3所示。它们共同产生了数字电路的称为电平恢复或再生的属性。再生信号电平会出现在每个数字组件的输出端,无论它们的粒度级别如何:门、触发器、存储芯片、处理器或完整的计算机系统。再生电平可创建清晰的接口,让一个子系统可以放心地连接到下一个子系统。与土木工程师的一堆石头不同,逻辑门的性能完全符合设计者的意图。
Digital systems are, at some lower level, constructed of analog components. The analog components chosen for this purpose are non-linear, and they have gain between input and output. When used appropriately, non-linearity allows inputs to have a wide tolerance, and gain ensures that outputs stay within narrow specifications, as shown in Figure 1.3. Together they produce the property of digital circuits called level restoration or regeneration. Regenerated signal levels appear at the output of every digital component, whatever their level of granularity: a gate, a flip-flop, a memory chip, a processor, or a complete computer system. Regenerated levels create clean interfaces that allow one subsystem to be connected to the next with confidence. Unlike the civil engineer’s pile of rocks, a logic gate performs exactly as its designer intends.
图 1.3数字元件的增益和非线性如何恢复电平。输入电平和输出电平跨越相同的值范围,但可接受的输入范围比生成的输出范围宽得多。
Figure 1.3 How gain and non-linearity of a digital component restore levels. The input level and output level span the same range of values, but the range of accepted inputs is much wider than the range of generated outputs.
系统发生故障的概率往往与设计者对其可靠性的信心成正比。
The probability of failure of a system tends to be proportional to the confidence that its designer has in its reliability.
— 系统传说,来源丢失
— systems folklore, source lost
静态规则和电平恢复不能保证具有数字输入和输出的设备永远不会出错。任何组件都可能出现故障。或者,本来应该为ONE的输入信号可能远远超出容差范围,以至于接收组件将其接受为ZERO。当发生这种情况时,错误接受该值的组件的输出也可能是错误的。重要的后果是数字组件会犯大错误,而不是小错误,正如我们在容错章节中看到的那样,大错误相对容易检测和处理。
The static discipline and level restoration do not guarantee that devices with digital inputs and outputs never make mistakes. Any component can fail. Or an input signal that is intended to be a ONE may be so far out of tolerance that the receiving component accepts it as a ZERO. When that happens, the output of the component that accepted that value incorrectly is likely to be wrong, too. The important consequence is that digital components make big mistakes, not little ones, and as we shall see when we reach the chapter on fault tolerance, big mistakes are relatively easy to detect and handle.
如果信号在通过一系列设备时不会积累噪声,那么噪声就不会限制可以串联的设备数量。换句话说,噪声不会限制数字系统的最大构图深度。与模拟系统不同,数字系统的复杂性可以不断增加,直到超出设计人员的理解能力。截至 2009 年,处理器芯片包含超过 20 亿个晶体管,远远超过任何模拟芯片。没有哪架客机拥有如此多的组件——除了机载计算机。
If a signal does not accumulate noise as it goes through a string of devices, then noise does not limit the number of devices one can string together. In other words, noise does not constrain the maximum depth of composition for digital systems. Unlike analog systems, digital systems can grow in complexity until they exceed the ability of their designers to understand them. As of 2009, processor chips contain over two billion transistors, far more than any analog chip. No airliner has nearly that many components—except in its on-board computers.
组合没有短期界限的第二个原因是计算机系统是由软件控制的。静态规则对复杂性的贡献可能很糟糕,但软件的贡献却更糟糕。硬件至少受到一些物理限制——光速、实际半导体材料中信号的稳定速率、相邻组件之间不必要的电耦合、热量的去除速率以及占用的空间。软件似乎没有任何物理限制,除了可用的内存来存储它和处理器来执行它。因此,软件的组合可以像人们创建它一样快。因此,我们经常听说操作系统、数据库系统甚至文字处理器由超过 1000 万个程序语句组成。
The second reason composition has no nearby bounds is that computer systems are controlled by software. Bad as the contribution to complexity from the static discipline may be, the contribution from software turns out to be worse. Hardware is at least subject to some physical limits—the speed of light, the rate of settling of signals in real semiconductor materials, unwanted electrical coupling between adjacent components, the rate at which heat can be removed, and the space that it occupies. Software appears to have no physical limits whatever beyond the availability of memory to store it and processors to execute it. As a result, composition of software can go on as fast as people can create it. Thus one routinely hears of operating systems, database systems, and even word processors consisting of more than 10 million program statements.
原则上,抽象可以通过将实现隐藏在模块接口之下来帮助控制软件组合。问题是,大多数抽象实际上都存在轻微的“漏洞”,因为它们无法完美地隐藏底层实现。漏洞的一个简单示例是整数加法:在大多数实现中,只要结果适合可用的字长,加法运算就完全符合数学规范,但如果结果大于该字长,则产生的溢出会给程序员带来麻烦。漏洞就像模拟系统中的噪声一样,会随着软件模块数量的增加而累积。与噪声不同,它以复杂性的形式累积,因此软件组合缺乏物理约束仍然是一个根本问题。因此,从机械角度来看,创建一个复杂性远远超出其设计者理解能力的系统是很容易的。而且由于这很容易,所以它经常发生,有时还会带来灾难性的后果。*
In principle, abstraction can help control software composition by hiding implementation beneath module interfaces. The problem is that most abstractions are, in reality, slightly “leaky” in that they don’t perfectly conceal the underlying implementation. A simple example of leakiness is addition of integers: in most implementations, the addition operation perfectly matches the mathematical specification as long as the result fits in the available word size, but if the result is larger than that, the resulting overflow becomes a complication for the programmer. Leakiness, like noise in analog systems, accumulates as the number of software modules grows. Unlike noise, it accumulates in the form of complexity, so the lack of physical constraints on software composition remains a fundamental problem. It is, therefore, mechanically easy to create a system with complexity that is far beyond the ability of its designers to understand. And since it is easy, it happens often, and sometimes with disastrous results.*
可能出错的事情和不可能出错的事情之间的主要区别在于,当不可能出错的事情出错时,通常会变得无法解决或修复。
The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair.
— 道格拉斯·亚当斯,《基本无害》(《银河系漫游指南 V》)(1993 年)
— Douglas Adams, Mostly Harmless (Hitchhiker’s Guide to the Galaxy V) (1993)
由于数字硬件的组成没有噪声限制,而软件的组成又有非常遥远的物理限制,粗心的设计师很容易滥用模块化、抽象、分层和层次结构等工具来增加复杂性。这种现象在桥梁和客机的设计中是鲜为人知的。与其他系统相比,计算机系统允许组合达到一定的深度,而设计师的理解能力是其首要限制。不幸的是,由于缺乏对组合深度的自然物理限制,设计师不得不构建更复杂的系统。如果自然没有对组合施加一个近似的限制,设计师就必须自己施加一个限制。由于很难拒绝一个听起来合理的功能,因此功能不断被添加。这就是太多计算机系统设计的命运。
Between the absence of a noise-imposed limit on composition of digital hardware and very distant physical limits on composition of software, it is too easy for an unwary designer to misuse the tools of modularity, abstraction, layering, and hierarchy to include still more complexity. This phenomenon is quite unknown in the design of bridges and airliners. In contrast with other systems, computer systems allow composition to a depth whose first limit is the designer’s ability to understand. Unfortunately, this lack of nearby natural, physical bounds on depth of composition tempts designers to build more complex systems. If nature does not impose a nearby limit on composition, the designer must self-impose a limit. Since it can be hard to say no to a reasonable-sounding feature, features keep getting added. Therein lies the fate of too many computer system designs.
出于边栏 1.6中部分解释的原因,在过去 35 年中,用于计算和通信的数字硬件的成本每年平均下降约 30%。这种变化速度意味着仅仅两年的时间就足以让技术将价格降低一半,而在七八年内,它就导致价格下降了 10 倍。一些组件的改进速度甚至更快。图 1.4显示了 25 年间磁盘存储的成本。在此期间,磁盘价格实际上大约每五年下降 10 倍,因此磁盘价格每年下降近 60%。磁盘专家预测至少在未来几年内仍将保持类似的改进速度。他们的预测似乎相对安全,因为已经致力于下一轮磁记录技术的开发实验室没有报告任何重大障碍。类似的图表适用于随机存取存储器、处理器成本和光纤传输速度。
For reasons partly explained by Sidebar 1.6, during the last 35 years the cost of the digital hardware used for computation and communication has dropped an average of about 30% each year. This rate of change means that just two years’ passage of time has been enough to allow technology to cut prices in half, and in seven or eight years it has led to a drop in prices by a factor of 10. Some components have experienced even greater rates of improvement. Figure 1.4 shows the cost of magnetic disk storage over a 25-year span. During that time, disk prices have actually dropped by a factor of 10 roughly every five years, so disk prices have dropped nearly 60% each year. Disk experts project a similar rate of improvement for at least another few years. Their projection seems relatively safe, since no major roadblocks have been reported by development laboratories that are already working on the next rounds of magnetic recording technology. Similar charts apply to random access memory, processor cost, and the speed of optical fiber transmission.
图 1.4 1983 年至 2007 年磁盘价格历史及预测。
Figure 1.4 Magnetic disk price history and projection, 1983–2007.
边栏 1.6 为什么计算机技术会随着时间的推移而呈指数级增长
Sidebar 1.6 Why Computer Technology has Improved Exponentially with Time
大众媒体经常使用“指数级”一词来描述计算机技术的爆炸式发展速度。斯蒂芬·沃德指出,这个形容词很合适是有充分理由的:计算机技术似乎是一门罕见的工程学科,在该学科中,正在改进的技术经常被用来改进技术。建造飞机、桥梁、摩天大楼和化工厂的人很少有机会。
Popular media frequently use the term “exponential” to describe the explosive rate of improvement of computer technology. Stephen Ward has pointed out that there is a good reason this adjective is appropriate: computer technology appears to be the rare engineering discipline in which the technology being improved is routinely employed to improve the technology. People building airplanes, bridges, skyscrapers, and chemical plants rarely, if ever, have this opportunity.
例如,微处理器的性能至少部分取决于其布局的巧妙程度,而布局的巧妙程度又受到利用光刻技术的计算机辅助布局工具可用时间的限制。如果英特尔通过改进布局,制造出速度提高一倍的奔腾处理器,那么一旦新奔腾处理器面市,它就会被用作处理器,使下一代奔腾处理器的布局工具运行速度提高一倍;下一个设计可以从布局中两倍的计算中受益。这种效应可能是摩尔定律的驱动因素之一,摩尔定律预测芯片上的元件数量将呈指数增长,翻倍时间为 18 个月 [进一步阅读建议 1.6.1 ]。
For example, the performance of a microprocessor is determined at least in part by the cleverness of its layout, which in turn is limited by the time available to use computer-assisted layout tools that can take advantage of lithography advances. If Intel, through improved layout, makes a version of the Pentium that is twice as fast, as soon as that new Pentium is available, it will be used as the processor to make the layout tools for the next Pentium run twice as fast; the next design can benefit from twice as much computation in its layout. This effect is probably one of the drivers of Moore’s law, which predicts an exponential increase in component count on chips with a doubling time of 18 months [Suggestions for Further Reading 1.6.1].
如果我们改进技术的速度确实与技术本身的质量成正比,那么我们可以将这个想法表达为
If indeed the rate at which we can improve our technology is proportional to the quality of the technology itself, we can express this idea as
它有一个指数解,
which has an exponential solution,
当然,实际情况比该方程式所暗示的要复杂得多,但所有与该形式稍相似的方程式(其中技术增长率是其当前状态的某种正函数)的解都会呈指数增长。
The actual situation is, of course, far more complicated than that equation suggests, but all equations that even remotely resemble that form, in which technology’s rate of growth is some positive function of its current state, have growing exponentials in their solution.
在现实世界中,指数最终必定会达到某个极限。在硬件方面,指数增长存在相当明显的基本物理限制,例如不确定性原理、切换门所需的最小能量以及设备散热的速率。有趣的是,尚不清楚哪一个会成为障碍,也不清楚何时会成为障碍。到目前为止,利用权衡的工程智慧推迟了清算日。对于软件,指数增长也存在类似的限制,但其性质尚不明确。
In the real world, exponentials must eventually hit some limit. In hardware there are fairly clear fundamental physical limits to exponential growth, such as the uncertainty principle, the minimum energy required to switch a gate, and the rate at which heat can be removed from a device. The interesting part is that it isn’t obvious which one is going to become the roadblock, or when. Thus far, engineering ingenuity in exploiting trade-offs has postponed the day of reckoning. For software, similar limits on exponential growth must exist, but their nature is not at all clear.
更直接的一点是,计算机和通信技术中的几乎每一项改进——无论是更快的芯片、更好的互联网路由算法、更有效的原型语言、更好的浏览器界面、更快的编译器、更大的磁盘还是更大的 RAM——都会立即被那些致力于更快芯片、更好的互联网路由算法、更有效的原型语言、更好的浏览器界面、更快的编译器、更大的磁盘或更大的 RAM 的人所利用。计算机系统设计师生活在一个巨大的反馈系统中,至少目前,该系统正在享受指数级的解决方案。
More to the immediate point, virtually every improvement in computer and communications technology—whether faster chips, better Internet routing algorithms, more effective prototyping languages, better browser interfaces, faster compilers, bigger disks, or larger RAM—is immediately put to work by everyone who is working on faster chips, better Internet routing algorithms, more effective prototyping languages, better browser interfaces, faster compilers, bigger disks, or larger RAM. Computer system designers live inside a giant feedback system that, at least for the moment, is enjoying exponential solutions.
技术的快速变化造成了计算机系统与其他工程系统之间的巨大差异。由于复杂系统的构建可能需要数年时间,因此当计算机系统准备交付时,其最初设计的基本规则已经发生了变化。不相称的缩放通常意味着当任何系统参数改变 2 倍时,设计人员必须调整应变,因为并非所有组件都按相同比例放大(或缩小)。更重要的是,当任何系统参数改变(十进制)数量级时,通常需要全新的设计。关于参数变化引起的应变的经验法则为我们提供了下一个设计原则:
This rapid change of technology has created a substantial difference between computer systems and other engineering systems. Since complex systems can take several years to build, by the time a computer system is ready for delivery, the ground rules under which it was originally designed have shifted. Incommensurate scaling typically means that the designer must adjust for strains when any system parameter changes by a factor of 2, because not all of the components scale up (or down) by the same proportion. More to the point, a whole new design is usually needed when any system parameter changes by a (decimal) order of magnitude. This rule of thumb about strains caused by parameter changes gives us our next design principle:
结构工程是一门将我们不完全了解的材料建模的艺术,将其塑造成我们无法精确分析的形状,以承受我们无法正确评估的力量,这样公众就没有理由怀疑我们无知的程度。
Structural engineering is the art of modeling materials we do not wholly understand, into shapes we cannot precisely analyse so as to withstand forces we cannot properly assess, in such a way that the public has no reason to suspect the extent of our ignorance.
— AR Dykes,英国结构工程师学会苏格兰分会(1946 年)
— A. R. Dykes, Scottish Branch, Institution of Structural Engineers (1946)
不相称的缩放规则
The Incommensurate Scaling Rule
将任何系统参数改变 10 倍通常都需要新的设计。
Changing any system parameter by a factor of 10 usually requires a new design.
If you design it so that it can be assembled wrong, someone will assemble it wrong.
— Edward A. Murphy, Jr.(墨菲定律原版释义,1949 年;见 边栏 2.5)这条规则与观察到的技术变化率相结合,意味着当新设计的计算机系统准备交付时,它可能已经需要两轮调整,并准备进行彻底的重新设计。即使设计师试图预测技术变化的影响,水晶球充其量也只是模糊的。更糟糕的是,在系统开发过程中,运行速度可能比系统完成时慢一个数量级,代码和数据不适合可用的地址空间,或者数据必须分布在多个硬盘上,而不是很好地放在一个硬盘上。人们可以弥补这些问题中的每一个,但每一种这样的补偿都会消耗智力资源并增加开发过程的复杂性。
This rule, when combined with the observed rate of change of technology, means that by the time a newly designed computer system is ready for delivery it may have already needed two rounds of adjustment and be ready for a complete redesign. Even if the designer has tried to predict the impact of technology change, crystal balls are at best cloudy. Worse, during the development of the system, things may run an order of magnitude slower than they will when the system is finished, the code and data don’t fit in the available address space, or perhaps the data has to be partitioned across several hard disks instead of nicely fitting on one. One can compensate for each of these problems, but each such compensation absorbs intellectual resources and contributes complexity to the development process.
即使没有这些调整或重新设计,最初的计划可能已经是一种新设计。一座桥梁(或飞机)可能与之前的桥梁(或飞机)有少量不同,但土木工程师(或航空工程师)最终设计的东西几乎总是与之前的桥梁(或飞机)只有一点点不同。在计算机系统的情况下,一两年前完全不切实际的想法很快就会成为主流,因此计算机系统设计师最终设计的东西几乎总是与之前的计算机系统有显著不同。这种差异使得对先前设计的深入分析对土木工程师和航空工程师来说比对计算机系统设计师更有价值,并且通常意味着在计算机系统中,在进行下一次重大修订之前,没有时间发现和解决先前设计的大部分错误。这些错误会大大增加复杂性。
Even without those adjustments or redesign, the original plan was probably already a new design. A bridge (or airplane) may have a modest number of things that are different from the previous one, but a civil (or aeronautical) engineer almost always ends up designing something that is only a little different from some previous bridge (or airplane). In the case of computer systems, ideas that were completely unrealistic a year or two ago can become mainstream in no time, so the computer system designer almost always ends up designing something that is significantly different from the previous computer system. This difference makes deep analysis of previous designs more rewarding for civil and aeronautical engineers than for computer system designers, and also usually means that in computer systems there hasn’t been time to discover and iron out most of the mistakes of the previous design before going on to the next major revision. Those mistakes can contribute strongly to complexity.
由于技术进步如此之快,计算机系统设计领域往往不像其他大多数工程领域那样重视详细的性能分析和微调。如果一台新的蒸汽涡轮机能将能量传输提高 1%,那么发电系统可能会从中受益匪浅,而计算机系统所需的 20% 性能改进通常只需等待四个月的下一轮硬件产品发布即可实现。如果重写应用程序以获得同样的改进需要一年的工作量,那么等待技术变革来解决问题可能更具成本效益。换句话说,技术的快速进步意味着蛮力解决方案(购买更多内存、等待更快的处理器、使用更简单的算法)通常是正确的方法,而在其他系统中,它们可能是不可想象的。穿越峡谷的铁路的所有者可能不会认为将峡谷炸宽并安装第二条轨道的提议在经济上是合理的。即使有资源可用,对环境的影响也会成为阻碍。
Because technology has improved so rapidly, the field of computer system design tends to place much less emphasis on detailed performance analysis and fine-tuning than do most other engineering endeavors. Where an electric power generation system may benefit dramatically from a new steam turbine that improves energy transfer by 1%, a needed 20% improvement in performance of a computer system can usually be obtained just by waiting four months for the next round of hardware product announcements. If a proposal to rewrite an application to obtain that same improvement would require a year of work, it is probably more cost-effective to just wait for technology change to solve the problem. Put another way, rapidly improving technology means that brute-force solutions (buy more memory, wait for a faster processor, use a simpler algorithm) are often the right approach in computer systems, whereas in other systems they may be unthinkable. The owner of the railroad through the canyon probably would not view as economically reasonable a proposal to blast the canyon wider and install a second track. Even if the resources were available, the environmental impact would be a deterrent.
这种“电话”有太多缺点,不能被认真地视为一种通讯手段。这种设备本身对我们来说毫无价值。
This “telephone” has too many shortcomings to be seriously considered as a means of communication. The device is inherently of no value to us.
— 通常被认为是 1876 年西联汇款内部备忘录,但是没有证据表明存在这份备忘录,它可能只是一个神话。
— frequently attributed to an 1876 Western Union internal memo, but there is no evidence of this memo and it is probably a myth.
计算机系统技术快速变化的第二个主要后果是,计算机系统的可用性以及与“人机工程”相关的品质总是参差不齐。要使系统变得可用、友好和宽容需要多年的反复试验,但当计算机技术的一个层次被掌握后,计算机技术的新层次就有可能以相同的成本提供许多新功能,或者以更便宜的价格向大量没有准备的新用户提供以前的功能。
A second major consequence of the rapid rate of change of technology in computer systems is that usability, and related qualities that go under the label “human engineering”, of computer systems is always ragged. It takes years of trial and error to make systems usable, friendly, and forgiving, but by the time one level of computer technology has been tamed, a new level of computer technology opens the possibilities of many new features at the same cost, or of providing the previous features more cheaply to a vast new audience of unprepared users.
同样,法律和司法程序需要几十年才能解决新问题,因为人们会争论各种政策的合理性、发现滥用行为并探索替代补救措施。面对快速变化的计算机系统技术,这些流程远远落后,延迟了如何奖励创新软件创意或应制定哪些规则保护计算机中存储的信息等问题的解决,并增加了计算机系统设计人员负担的需求不确定性。*
Similarly, legal and judicial processes take decades to come to grips with new issues, as people debate the wisdom of various policies, discover abuses, and explore alternative remedies. In the face of rapidly changing computer system technology, these processes fall far behind, delaying resolution of such concerns as how to reward innovative software ideas, or what rules should protect information stored in computers, and adding uncertainty of requirements to the burden of the computer system designer.*
最后,现代的高速通信技术覆盖全球,大大加快了人们发现新技术的实用性并加以采用的速度。电力和电话从新奇事物发展到广泛使用用了几十年的时间,而数码相机和 DVD 等最新发明在不到十年的时间内就席卷了市场,CNN 或《新闻周刊》上只要提到一个以前不为人知的万维网网站,该网站就会突然被每天数百万次的点击量所淹没。更普遍的是,新出现的可行应用程序(如点对点文件共享)几乎可以在一夜之间改变现有系统的工作负载状况。
Finally, modern high-speed communications with global reach have greatly accelerated the rate at which people discover that a new technology is useful and adopt it. Where it took several decades for electricity and the telephone to move from curiosities to widespread use, recent innovations such as digital cameras and DVDs have swept their markets in less than a decade, and a single mention of a previously obscure World Wide Web site on CNN or in Newsweek magazine can cause that site to be suddenly overwhelmed with millions of hits per day. More generally, newly viable applications, such as peer-to-peer file sharing, can change the shape of the workload on existing systems practically overnight.
因此,计算机系统研究涉及规划、检查需求、定制细节以及与用户和社会整合的常规过程的压缩。这种压缩导致交付的系统存在缺陷,没有最聪明的想法。建造飞机和桥梁的人不必面对这些问题。这些问题可以被视为令人沮丧的困难或令人兴奋的挑战,这取决于一个人的观点。
Thus, the study of computer systems involves telescoping of the usual processes of planning, examining requirements, tailoring details, and integrating with users and society. This telescoping leads to the delivery of systems that have rough edges and without the benefit of the cleverest thought. People who build airplanes and bridges do not have to face these problems. Such problems can be viewed either as a frustrating difficulty or as an exciting challenge, depending on one’s perspective.
硬件的适度物理限制和软件的极大物理限制共同为我们创造了难以想象且难以管理的复杂系统的机会,而技术变化的快速步伐诱使设计师使用新的和未经测试的基本规则来交付系统。与其他工程领域的系统相比,这两种效应放大了计算机系统的复杂性。因此,计算机系统设计师需要一些额外的工具来应对复杂性。
Modest physical limits in hardware and very distant physical limits in software together give us the opportunity to create systems of unimaginable—and unmanageable—complexity, and the rapid pace of technology change tempts designers to deliver systems using new and untested ground rules. These two effects amplify the complexity of computer systems when compared with systems from other engineering areas. Thus, computer system designers need some additional tools to cope with complexity.
书本很快就会在公立学校里过时。……用电影可以教授人类知识的每一个分支。我们的学校体系将在十年内彻底改变。
Books will soon be obsolete in the public schools. … It is possible to teach every branch of human knowledge with the motion picture. Our school system will be completely changed inside of ten years.
— 托马斯·爱迪生,引自《纽约戏剧镜报》(1913 年 7 月 9 日)模块化、抽象、分层和层次结构有很大帮助,但它们本身不足以控制由此产生的复杂性。原因是这四种技术都假设设计师了解正在设计的系统。在现实的、快速变化的计算机系统世界中,很难做出选择
Modularity, abstraction, layering, and hierarchy are a major help, but by themselves they aren’t enough to keep the resulting complexity under control. The reason is that all four of those techniques assume that the designer understands the system being designed. In the real, fast-changing world of computer systems, it is hard to choose
the right modularity from a sea of plausible alternative modularities.
the right abstraction from a sea of plausible alternative abstractions.
the right layering from a sea of plausible alternative layerings.
the right hierarchy from a sea of plausible alternative hierarchies.
虽然有一些设计原则可用,但它们太少了,唯一真正的指导来自于以前系统的经验。
Although some design principles are available, they are far too few, and the only real guidance comes from experience with previous systems.
正如所料,计算机系统的设计人员已经开发并改进了至少一种额外的技术来应对复杂性。其他类型系统的设计人员也使用这种技术,但他们通常不认为它对成功的重要性与对计算机系统的重要性一样高,可能是因为这种技术在软件方面特别可行。这是一个称为迭代的开发过程。
As might be expected, designers of computer systems have developed and refined at least one additional technique to cope with complexity. Designers of other kinds of systems use this technique as well, but they usually do not consider it to be so fundamental to success as it is for computer systems, probably because the technique is particularly feasible with software. It is a development process called iteration.
迭代的本质是先构建一个简单、可运行的系统,该系统仅满足一小部分需求,然后逐步改进该系统,以逐渐涵盖越来越多的全套需求。这个想法是,小步前进有助于降低复杂性压倒系统设计的风险。始终拥有一个可用的工作系统有助于确保可以构建某些东西,并提供当前技术基本规则的持续体验以及发现和修复错误的机会。最后,在系统开发过程中对技术变化的调整更容易作为一个或多个迭代的一部分纳入其中。当您看到一个被标识为“版本 5.4”的软件时,这通常表明供应商正在使用迭代。
The essence of iteration is to start by building a simple, working system that meets only a modest subset of the requirements and then evolve that system in small steps to gradually encompass more and more of the full set of requirements. The idea is that small steps can help reduce the risk that complexity will overwhelm a system design. Having a working system available at all times helps provide assurance that something can be built and provides on-going experience with the current technology ground rules as well as an opportunity to discover and fix bugs. Finally, adjustments for technology changes that arrive during the system development are easier to incorporate as part of one or more of the iterations. When you see a piece of software identified as “release 5.4”, that is usually an indication that the vendor is using iteration.
成功的迭代需要相当的远见。这种远见涉及几个要素,其中两个我们定义为设计原则:
Successful iteration requires considerable foresight. That foresight involves several elements, two of which we identify as design principles:
我认为全球市场大概需要五台计算机。
I think there is a world market for maybe five computers.
— 人们经常声称这是 IBM 董事长 Thomas J. Watson, Sr. 在 1943 年的一次演讲中所说的话,但是几乎没有证据表明这只是一个传说。迭代设计
Design for Iteration
你不会第一次就做对,因此要让改变变得容易。
You won’t get it right the first time, so make it easy to change.
记录设计背后的假设,这样当需要更改设计时,您可以更轻松地找出还需要更改的内容。不仅要修改和替换模块,还要随着对系统及其要求的深入了解而重新模块化。
Document the assumptions behind the design so that when the time comes to change the design you can more easily figure out what else has to change. Expect not only to modify and replace modules, but also to remodularize as the system and its requirements become better understood.
采取小步骤。这样做的目的是可以快速发现设计错误和糟糕的想法,这样就可以毫不费力地更改或删除它们,避免系统在后续迭代中的其他部分开始依赖它们,从而导致它们实际上无法更改。积极开发的系统可能每天都要进行完整的系统重建,因为重建过程会调用大量检查和测试,这些检查和测试可以揭示实施错误,而导致错误的更改在实施者的脑海中仍然记忆犹新。
Take small steps. The purpose is to allow discovery of both design mistakes and bad ideas quickly, so that they can be changed or removed with small effort and before other parts of the system in later iterations start to depend on them and they effectively become unchangeable. Systems under active development may be subjected to a complete system rebuild every day because the rebuilding process invokes a large number of checks and tests that can reveal implementation mistakes, while the changes that caused the mistakes are fresh in the minds of the implementers.
不要着急。即使单个步骤可能很小,也必须经过精心规划。在大多数项目中,人们都会急于实施。在迭代设计中,这种诱惑会更加强烈,设计师必须确保设计已为下一步做好准备。
Don’t rush. Even though individual steps may be small, they must still be well planned. In most projects, the temptation is to rush to implementation. With iterative design, that temptation can be stronger, and the designer must make sure that the design is ready for the next step.
规划反馈。将反馈路径和提供反馈的积极激励作为设计的一部分。测试人员、安装人员、维护人员和系统用户可以提供改进系统所需的大量信息。Alpha 测试(“我们完全不确定这是否有效”)和 beta 测试(“似乎有效,使用风险自负”)是常见的例子,许多供应商鼓励用户通过电子邮件报告问题的详细信息和故障记录。设计良好的系统将在各个级别提供许多此类反馈方案。
Plan for feedback. Include as part of the design both feedback paths and positive incentives to provide feedback. Testers, installers, maintainers, and users of the system can provide much of the information needed to refine it. Alpha testing (“we’re not at all sure this even works”) and beta testing (“seems to work, use at your own risk”) are common examples, and many vendors encourage users to report details of problems and transcripts of failures by e-mail. A well-designed system will provide many such feedback schemes at all levels.
研究失败。一个重要的目标是从失败中吸取教训,而不是将责任归咎于失败。必须精心设计激励措施,以确保有关失败的反馈不会被害怕受到指责的人忽视甚至压制。然后,在找到失败的明显原因后,
Study failures. An important goal is to learn from failures rather than assign blame for them. Incentives must be carefully designed to ensure that feedback about failures is not ignored or even suppressed by people fearful of being blamed. Then, having found the apparent cause of a failure,
继续挖掘
Keep Digging
复杂系统因复杂的原因而失败。
Complex systems fail for complex reasons.
未来的计算机重量可能不超过1.5吨。
Computers in the future may weigh no more than 1.5 tons.
— 《大众机械》(1949 年 3 月)
— Popular Mechanics (March 1949)
继续寻找其他促成因素或更基本的原因。工作系统之所以能工作,其原因往往不为人所知。我们经常会发现,系统的新版本会暴露出一个错误,而这个错误实际上已经存在很长时间了,但直到现在才变得重要。通过弄清楚为什么这个错误从来都不重要,我们可以学到很多东西。探索设计师的思维方式也很有用,以了解是什么让他们设计出了一个可能以这种方式失败的系统。*同样,不要忽视无法解释的行为。如果反馈报告的某些事情现在似乎不是问题或已经消失了,那么这可能是某个地方出了问题,而不是系统神奇地自行修复了。
Continue looking for other contributing or more basic causes. Working systems often work for reasons that aren’t well understood. It is common to find that a new release of a system reveals a bug that has actually been in the system for a long time but has never mattered until now. Much can be learned by figuring out why it never mattered. It can also be useful to explore the mindset of the designers to understand what allowed them to design a system that could fail in this way.* Similarly, don’t ignore unexplained behavior. If the feedback reports something that now seems not to be a problem or to have gone away, it is probably a sign that something is wrong rather than that the system magically fixed itself.
迭代听起来像是一种简单的技术,但它往往会受到一些障碍的干扰。主要的障碍是,随着设计通过一系列迭代不断发展,可能会出现失去概念完整性的风险。这种风险意味着,系统最初、最简单版本的总体规划必须适应达到最终版本所需的所有迭代(因此需要有远见)。必须有人时刻保持警惕,以确保尽管在迭代过程中进行了更改,但总体设计原理仍然清晰。
Iteration sounds like a straightforward technique, but several obstacles tend to interfere with it. The main obstacle is that as a design evolves through a series of iterations, a risk of losing conceptual integrity arises. That risk suggests that the overall plan for the initial, simplest version of the system must accommodate all of the iterations needed to reach the final version (thus the need for foresight). Someone must constantly be on guard to make sure that the overall design rationale remains clear despite changes made during iteration.
在大多数组织中,好消息(例如,系统的主要部分正在提前完成工作)会迅速传遍整个组织,但坏消息(例如,某个重要模块尚未工作)通常只限于发现该消息的组织部分,至少直到该部分能够解决问题并报告好消息为止。这种现象称为坏消息二极管,会阻碍人们认识到更改系统的其他部分更为合适。
In most organizations, good news (e.g., a major piece of the system is working ahead of schedule) flows rapidly throughout the organization, but bad news (e.g., an important module isn’t working yet) often gets confined to the part of the organization that discovers it, at least until it can fix the problem and report good news. This phenomenon, the bad-news diode, can prevent realization that changing a different part of the system is more appropriate.
一个相关的问题是,当有人最终意识到模块化是错误的,出于两个原因,很难改变。首先,不可动摇的基础规则(见第 20 页)开始发挥作用。根据定义,改变模块化涉及改变多个模块,有时甚至多个模块。其次,设计师投入时间和精力开发一个从他们的角度来看是按预期完成的模块,他们可能不愿意看到这些时间和精力在返工中白白浪费。简而言之,要改变模块化,必须处理已承诺的组件和已承诺的设计师。
A related problem is that when someone finally realizes that the modularity is wrong, it can be hard to change, for two reasons. First, the unyielding foundations rule (see page 20) comes into play. Changing modularity by definition involves changing more than one module, and sometimes several. Second, designers who have invested time and effort in developing a module that, from their point of view, is doing what was intended can be reluctant to see this time and effort lost in a rework. Simply put, to change modularity one must deal with both committed components and committed designers.
根据广泛的财务和市场分析,预计新款 Haloid 机器的销量不会超过五千台。……914 型在办公室复印市场上没有未来。
Based on extensive financial and market analysis, it’s projected that no more than five thousand of the new Haloid machines will sell. … Model 914, has no future in the office copying market.
— 咨询公司 Arthur D. Little 向 IBM 提交的有关静电复印机前景的报告(1959 年)当初始设计既简单又成功时,迭代的长期风险有时会显现出来。成功可能会导致设计师过于自信,并在后续迭代中过于雄心勃勃。自部署系统初始版本以来,技术已经得到改进,反馈建议了许多新功能。每个建议的功能本身看起来都很简单,很难判断它们会如何相互作用。结果往往是灾难性的过度扩张和随之而来的失败,这种失败非常常见,以至于有一个名字:第二系统效应。
A longer-term risk of iteration sometimes shows up when the initial design is both simple and successful. Success can lead designers to be overconfident and to be too ambitious on a later iteration. Technology has improved in the time since deployment of the initial version of the system and feedback has suggested lots of new features. Each suggested feature looks straightforward by itself, and it is difficult to judge how they might interact. The result is often a disastrous overreaching and consequent failure that is so common that it has a name: the second-system effect.
迭代可以被认为是将模块化应用于系统设计和实施过程的管理。因此,它将我们带入了管理技术的领域,而本书并未直接涉及这些领域。*
Iteration can be thought of as applying modularity to the management of the system design and implementation process. It thus takes us into the realm of management techniques, which are not directly addressed in this book.*
值得注意的是,应对复杂性最有效的技术之一也是最难应用的技术:简单性。如第 1.4.1 节所述,计算机系统缺乏自然的物理限制来抑制其复杂性,因此设计师必须施加限制;否则设计师可能会不堪重负。
Remarkably, one of the most effective techniques in coping with complexity is also one that is most difficult to apply: simplicity. As Section 1.4.1 explained, computer systems lack natural physical limits to curb their complexity, so the designer must impose limits; otherwise the designer risks being overwhelmed.
显而易见的建议是保持简单,但问题在于
The problem with the apparently obvious advice to keep it simple is that
previous systems give a taste of how great things could be if more features were added.
the technology has improved so much that cost and performance are not constraints.
each of the suggested new features has been successfully demonstrated somewhere.
none of the exceptions or other complications seems by itself to be especially hard to deal with.
there is fear that a competitor will market a system that has even more features.
在系统设计师中,傲慢、骄傲和过度自信比对复杂性危险的清醒认识更为常见。
among system designers, arrogance, pride, and overconfidence are more common than clear awareness of the dangers of complexity.
这些考虑使得我们很难对任何一项要求、功能、例外或复杂情况说“不”。正是它们的累积影响导致了图 1.1所示的复杂性爆炸式增长。系统设计师必须始终牢记这种累积影响。最重要的是,计算机系统设计师对抗复杂性的最有力武器就是说“不。这会让事情变得太复杂。”
These considerations make it hard to say “no” to any one requirement, feature, exception, or complication. It is their cumulative impact that produces the complexity explosion illustrated in Figure 1.1. The system designer must keep this cumulative impact in mind at all times. The bottom line is that a computer system designer’s most potent weapon against complexity is the ability to say, “No. This will make it too complicated.”
没有理由人们会想在家里放一台电脑。
There is no reason anyone would want a computer in their home.
— 肯尼斯·奥尔森,数字设备公司总裁(1977 年)
— Kenneth Olsen, president of Digital Equipment Corporation (1977)
当我们继续研究特定的计算机系统工程主题时,我们将充分利用一种特殊的简单性,在某种程度上它是另一种设计原则:
As we proceed to study specific computer system engineering topics, we shall make much use of a particular kind of simplicity, to the extent that it is yet another design principle:
采取彻底的简化措施
Adopt Sweeping Simplifications
这样你就能看到自己正在做什么。
So you can see what you are doing.
每个主题领域都会明确引入一个或多个全面的简化。原因是它们允许设计者提出令人信服的正确性论据,使细节变得无关紧要,并让所有参与者清楚地了解到底发生了什么。它们将成为我们控制复杂性的最大希望之一。
Each topic area will explicitly introduce one or more sweeping simplifications. The reason is that they allow the designer to make compelling arguments for correctness, they make detail irrelevant, and they make clear to all participants exactly what is going on. They will turn out to be one of our best hopes for keeping control of complexity.
本章介绍了计算机系统研究的一些基本思想。在建立这些基本思想的过程中,接下来的章节将根据三个反复出现的主题探讨一系列系统工程主题:
This chapter has introduced some basic ideas that underlie the study of computer systems. In the course of building on these basic ideas, the ensuing chapters explore a series of system engineering topics in the light of three recurring themes:
模块化出现在每个工程主题中,要么是该主题的目标之一,要么是其设计基石之一。章节标题中的文字暗示了这一主题。抽象和分层是建立模块化的特殊方法。命名是互连和替换模块的基本机制。客户端和服务以及虚拟化是实施模块化的两种方式。网络建立在模块化的基础上。在容错方面,模块是限制故障程度的单位。原子性是一种非常强大的模块化形式,设计人员可以利用它来获得一致性。最后,信息保护涉及进一步加强模块化墙。
Modularity appears in each engineering topic either as one of the goals of that topic or as one of its design cornerstones. Words from chapter titles suggest this theme. Abstractions and layering are particular ways to build on modularity. Naming is a fundamental mechanism for interconnecting and replacing modules. Clients and services and virtualization are two ways of enforcing modularity. Networks are built on a foundation of modularity. In fault tolerance, the module is the unit that limits the extent of failure. Atomicity is an exceptionally robust form of modularity that the designer can exploit to obtain consistency. Finally, protection of information involves further strengthening of modular walls.
第二个主题,即基于原则的系统设计,已经出现,既明确提到了几个原则,也列在了封面内页的设计原则列表中。这些原则以简短的短语概括了一代又一代计算机系统设计师开发出的广泛适用的智慧。后面的章节应用了这些一般原则,并介绍了更具体到特定工程领域的其他设计原则。即使考虑到这些原则,也常常很难提供精确的设计方案。因此,在整篇文章中,读者会发现第二种形式的智慧,即几个设计提示,它们编码了权衡的理由。*总之,这些原则和提示表明,计算机系统设计虽然大部分不是基于数学理论,但也不是完全临时的:它实际上基于从成功和失败系统的经验和分析中得出的合理原则。理解并吸收这些原则和提示的读者将学到本书的大部分内容。
The second theme, principle-based system design, has already emerged, both in explicit mention of several principles and in the list of design principles on the inside front cover. These principles capture, in brief phrases, widely applicable nuggets of wisdom that have been developed by generations of computer system designers. Later chapters apply these general principles and also introduce additional design principles that are more specific to particular engineering areas. Even with these principles in mind, it is often difficult to offer a precise recipe for design. Therefore throughout the text the reader will find a second form of captured wisdom in the form of several design hints that encode rationales for making trade-offs.* Together, the principles and hints suggest that computer system design, though for the most part not based on mathematical theories, is also not completely ad hoc: it is actually based on sound principles derived from experience and analysis of both successful and failed systems. The reader who understands and absorbs these principles and hints will have learned much of what this book has to say.
第三个主题,即使系统更健壮和更具弹性,也已经出现,既出现在健壮性原则的陈述中,也出现在模块化通过限制互连来帮助控制影响传播的想法中。健壮性和弹性这两个术语是对设计总体目标的非正式和重叠描述:系统不应该对其环境中的适度、长期变化敏感(通常称为健壮性),并且它应该在面对短暂的逆境时继续正确运行(通常称为弹性)。接下来的每一章都介绍了至少一种逐渐增强的方法来使系统更健壮和更具弹性。因此,关于命名的章节展示了名称的间接性如何使系统不那么脆弱。然后,关于客户端和服务以及虚拟化的章节展示了如何实施模块化以限制错误和事故的影响。关于网络的章节介绍了在通信失败的情况下提供可靠通信的技术。然后,关于容错的章节概括了这些技术,以使整个系统具有弹性,即使它们包含有故障的组件。关于原子性和一致性的章节将容错技术应用于维护存储数据完整性这一特殊问题,尽管存在并发活动并且面临软件和硬件故障。最后,关于保护信息的章节介绍了限制恶意对手影响的技术,这些对手会故意窃取、修改或拒绝访问信息。
The third theme, making systems robust and resilient, has also already emerged, both in the statement of the robustness principle and with the idea that modularity, by limiting interconnections, can help control propagation of effects. The terms robustness and resilience are informal and overlapping descriptions of a general goal of design: that a system should not be sensitive to modest, long-term shifts in its environment (usually called robustness) and that it should continue operating correctly in the face of transient adversity (usually called resilience). Each succeeding chapter introduces at least one progressively stronger way to make a system more robust and resilient. Thus, the chapter on naming shows how indirection of names can make systems less fragile. Then, the chapters on clients and services and on virtualization demonstrate how to enforce modularity to limit the effects of mistakes and accidents. The chapter on networks introduces techniques that provide reliable communications despite communication failures. The chapter on fault tolerance then generalizes those techniques to make entire systems resilient, even though they contain faulty components. The chapters on atomicity and consistency apply fault tolerance techniques to the particular problem of maintaining the integrity of stored data, despite concurrent activity and in the face of software and hardware failures. Finally, the chapter on protecting information introduces techniques to limit the impact of malicious adversaries who would deliberately steal, modify, or deny access to information.
1.1对还是错?解释:模块化可以降低复杂性,因为
1.1 True or false? Explain: modularity reduces complexity because
A. It reduces the effect of incommensurate scaling.
B. It helps control propagation of effects.
1994-1-3d 和 1995-1-1e
1994-1-3d and 1995-1-1e
1.2对还是错?解释:层次结构降低了复杂性,因为
1.2 True or false? Explain: hierarchy reduces complexity because
A. It reduces the size of individual modules.
B. It cuts down on the number of interconnections between elements.
C. It assembles a number of smaller elements into a single larger element.
D. It enforces a structure on the interconnections between elements.
1994-1-3c 和 1999-1-02
1994-1-3c and 1999-1-02
1.3如果某人创建了个人友谊图,则该图会具有层级结构。对还是错?
1.3 If one created a graph of personal friendships, one would have a hierarchy. True or false?
1995-1-1b
1995-1-1b
1.4在复杂的计算机系统中通常会观察到下列哪项现象?
1.4 Which of the following is usually observed in a complex computer system?
A. The underlying technology has a high rate of change.
B. It is easy to write a succinct description of the behavior of the system.
C. It has a large number of interacting features.
D.它表现出一些突发特性,使得系统的性能比系统设计者所设想的更好。
D. It exhibits emergent properties that make the system perform better than envisioned by the system’s designers.
2005-1-1
2005-1-1
1.5 Ben Bitdiddle 编写了一个包含 16 个主要代码模块的程序。每个模块包含多个过程。在第一次执行该程序时,他发现每个模块至少包含一次对其他每个模块的调用。每个模块包含 100 行代码。
1.5 Ben Bitdiddle has written a program with 16 major modules of code. Each module contains several procedures. In the first implementation of his program, he finds that each module contains at least one call to every other module. Each module contains 100 lines of code.
1.5a How long is Ben’s program in lines of code?
1.5b他的实现中有多少个模块互连?(从一个模块到另一个模块的每次调用都是一次互连。)
1.5b How many module interconnections are there in his implementation? (Each call from one module to another is an interconnection.)
Ben 决定改变实现方式。现在有四个主模块,每个模块包含四个子模块,每个子模块都属于一个层次结构。四个主模块都调用所有其他主模块,并且在每个主模块中,四个子模块都相互调用。每个子模块仍然有 100 行代码,但每个主模块需要 100 行管理代码。
Ben decides to change the implementation. Now there are four main modules, each containing four submodules in a one-level hierarchy. The four main modules each have calls to all the other main modules, and within each main module, the four submodules each have calls to one another. There are still 100 lines of code per submodule, but each main module needs 100 lines of management code.
1.5c How long is Ben’s program now?
1.5天现在有多少个互连?包括模块到模块和子模块到子模块的互连。
1.5d How many interconnections are there now? Include module-to-module and submodule-to-submodule interconnections.
1.5e Was using hierarchy a good decision? Why or why not?
1996-1-2a…e
1996-1-2a…e
与第 1 章相关的附加练习可以在从第 425 页开始的问题集中找到。
Additional exercises relating to Chapter 1 can be found in the problem sets beginning on page 425.
*计算机行业顾问(也是本教科书所在课程的前讲师) Michael Hammer 建议采用该设计原则的非正式版本。
* Computer industry consultant (and erstwhile instructor of the course for which this textbook was written) Michael Hammer suggested the informal version of this design principle.
* Laszlo A. Belady 和 Meir M. Lehman 在《大型程序开发模型》IBM Systems Journal 15, 3(1976),第 225-252 页中记录了这一现象。
* This phenomenon was documented by Laszlo A. Belady and Meir M. Lehman in “A model of large program development”, IBM Systems Journal 15, 3 (1976), pages 225–252.
†迈克尔·D·施罗德 (Michael D. Schroeder) 提出了这个峡谷铁路线的例子。
† Michael D. Schroeder suggested this example of a railroad line in a canyon.
* “泄漏”这一术语显然是软件开发人员 Joel Spolsky 提出的。
* The terminology “leaky” is apparently due to software developer Joel Spolsky.
*劳伦斯·莱斯格 (Lawrence Lessig) 在《代码:以及网络空间的其他法律》[进一步阅读建议 1.1.4 ]中对法律、社会和计算机技术的相互作用进行了良好的分析。
* Lawrence Lessig provides a good analysis of the interactions of law, society, and computer technology in Code: and Other Laws of Cyberspace [Suggestions for Further Reading 1.1.4].
*从失败中学习的理念和对复杂系统因复杂原因而失败的观察是 Henry Petroski 的一本引人入胜的书《设计范式:工程中的错误和判断案例历史》的主题[进一步阅读建议 1.2.3 ]。
* The idea of learning from failure and the observation that complex systems fail for complex reasons are the themes of a fascinating book by Henry Petroski, Design Paradigms: Case Histories of Error and Judgment in Engineering [Suggestions for Further Reading 1.2.3].
*资深设计师 Frederick P. Brooks Jr. 撰写的《人月神话》是一本关于系统开发的优秀书籍[进一步阅读建议 1.1.3 ]。另一本强烈推荐阅读的书是 Fernando J. Corbató 撰写的阿兰图灵奖演讲“论构建必将失败的系统” [进一步阅读建议 1.5.3 ]。
* An excellent book on the subject of system development, by a veteran designer, is Frederick P. Brooks Jr., The Mythical Man-Month [Suggestions for Further Reading 1.1.3]. Another highly recommended reading is the Alan Turing Award lecture by Fernando J. Corbató, “On building systems that will fail” [Suggestions for Further Reading 1.5.3].
*许多(如果不是全部的话)提示最初都是由 Butler Lampson 在他的论文“计算机系统设计提示”中描述的 [进一步阅读建议 1.5.4 ]。
* Many, if not all, of the hints were originally described by Butler Lampson in his paper “Hints for computer system design” [Suggestions for Further Reading 1.5.4].
2.1三个基本抽象
2.1 The Three Fundamental Abstractions
2.2计算机系统中的命名
2.2 Naming in Computer Systems
2.2.1命名模型
2.2.1 The Naming Model
2.2.2默认和显式上下文引用
2.2.2 Default and Explicit Context References
2.2.3路径名、命名网络和递归名称解析
2.2.3 Path Names, Naming Networks, and Recursive Name Resolution
2.2.4多重查找:通过分层上下文进行搜索
2.2.4 Multiple Lookup: Searching Through Layered Contexts
2.2.5比较名称
2.2.5 Comparing Names
2.2.6名称探索
2.2.6 Name Discovery
2.3 Organizing Computer Systems with Names and Layers
2.4回顾与展望
2.5 Case study: UNIX ® file system layering and naming
2.5.1UNIX文件系统的应用程序编程接口
2.5.1 Application Programming Interface for the UNIX File System
2.5.2块层
2.5.2 The Block Layer
2.5.3文件层
2.5.3 The File Layer
2.5.4Inode编号层
2.5.4 The Inode Number Layer
2.5.5文件名层
2.5.5 The File Name Layer
2.5.6路径名称层
2.5.6 The Path Name Layer
2.5.7链接
2.5.7 Links
2.5.8重命名
2.5.8 Renaming
2.5.9绝对路径名层
2.5.9 The Absolute Path Name Layer
2.5.10符号链接层
2.5.10 The Symbolic Link Layer
2.5.11实现文件系统 API
2.5.11 Implementing the File System API
2.5.12Shell 和隐含上下文、搜索路径和名称发现
2.5.12 The Shell and Implied Contexts, Search Paths, and Name Discovery
2.5.13进一步阅读的建议
尽管计算机系统组件的潜在抽象数量是无限的,但实际上实际出现的绝大多数都属于三个明确定义的类别之一:内存、解释器和通信链路。这三个抽象是如此基础,以至于理论家们根据计算机算法必须记住的数据项数量、解释器必须执行的步骤数量以及必须传达的消息数量来比较计算机算法。
Although the number of potential abstractions for computer system components is unlimited, remarkably the vast majority that actually appear in practice fall into one of three well-defined classes: the memory, the interpreter, and the communication link. These three abstractions are so fundamental that theoreticians compare computer algorithms in terms of the number of data items they must remember, the number of steps their interpreter must execute, and the number of messages they must communicate.
设计师使用这三种抽象来组织物理硬件结构,并不是因为它们是连接门的唯一方法,而是因为
Designers use these three abstractions to organize physical hardware structures, not because they are the only ways to interconnect gates, but rather because
they supply fundamental functions of recall, processing, and communication,
到目前为止,这些是唯一被证明具有广泛用途并且具有可理解的简单接口语义的硬件抽象。
so far, these are the only hardware abstractions that have proven both to be widely useful and to have understandably simple interface semantics.
为了满足不同应用程序的众多需求,系统设计人员在此基础之上构建了层次,但在这样做时,他们通常不会创建完全不同的抽象。相反,他们会精心设计相同的三个抽象,重新排列和重新打包它们,以创建有用的功能和方便每个应用程序的界面。因此,例如,个人计算机或网络服务器等通用系统的设计人员会开发出展现相同三个抽象的高度精炼形式的界面。反过来,用户可能会看到以有组织的文件或数据库系统形式出现的内存、以文字处理器、游戏系统或高级编程语言形式出现的解释器以及以即时消息或万维网形式出现的通信链接。仔细检查,这些抽象中的每一个之下都是基于这些抽象的基本硬件版本构建的一系列层。
To meet the many requirements of different applications, system designers build layers on this fundamental base, but in doing so they do not routinely create completely different abstractions. Instead, they elaborate the same three abstractions, rearranging and repackaging them to create features that are useful and interfaces that are convenient for each application. Thus, for example, the designer of a general-purpose system such as a personal computer or a network server develops interfaces that exhibit highly refined forms of the same three abstractions. The user, in turn, may see the memory in the form of an organized file or database system, the interpreter in the form of a word processor, a game-playing system, or a high-level programming language, and the communication link in the form of instant messaging or the World Wide Web. On examination, underneath each of these abstractions is a series of layers built on the basic hardware versions of those same abstractions.
计算机系统的抽象组件交互的主要方法是引用。这意味着一个组件连接到另一个组件的通常方式是通过名称。名称出现在所有三个基本抽象的接口以及它们更复杂的更高层对应接口中。内存按名称存储和检索对象,解释器操作命名对象,名称标识通信链路。因此,名称是连接抽象的粘合剂。如果设计得当,命名互连可以轻松更改。名称还允许共享对象,并且允许稍后查找以前创建的对象。
A primary method by which the abstract components of a computer system interact is reference. What that means is that the usual way for one component to connect to another is by name. Names appear in the interfaces of all three of the fundamental abstractions as well as the interfaces of their more elaborate higher-layer counterparts. The memory stores and retrieves objects by name, the interpreter manipulates named objects, and names identify communication links. Names are thus the glue that interconnects the abstractions. Named interconnections can, with proper design, be easy to change. Names also allow the sharing of objects, and they permit finding previously created objects at a later time.
本章从抽象、命名和分层的角度简要回顾了计算机系统的架构和组织。本章的某些部分对于具有计算机软件或硬件背景的读者来说很熟悉,但系统视角可能会为这些熟悉的概念提供一些新的见解,并为后续章节奠定基础。第 2.1 节描述了三个基本抽象,第 2.2 节提出了一个命名模型并解释了名称在计算机系统中的使用方式,第 2.3 节讨论了设计人员如何使用名称和层将抽象结合起来,以创建一个典型的计算机系统,并将文件系统作为使用命名和分层进行内存抽象的具体示例。第 2.4 节介绍了本书的其余部分将如何设计三个基本抽象中的一个或多个的某个更高级版本,使用名称进行互连并分层构建。第 2.5 节是一个案例研究,展示了如何在实际文件系统中应用抽象、命名和分层。
This chapter briefly reviews the architecture and organization of computer systems in the light of abstraction, naming, and layering. Some parts of this review will be familiar to the reader with a background in computer software or hardware, but the systems perspective may provide some new insights into those familiar concepts and it lays the foundation for coming chapters. Section 2.1 describes the three fundamental abstractions, Section 2.2 presents a model for naming and explains how names are used in computer systems, and Section 2.3 discusses how a designer combines the abstractions, using names and layers, to create a typical computer system, presenting the file system as a concrete example of the use of naming and layering for the memory abstraction. Section 2.4 looks at how the rest of this book will consist of designing some higher-level version of one or more of the three fundamental abstractions, using names for interconnection and built up in layers. Section 2.5 is a case study showing how abstractions, naming, and layering are applied in a real file system.
我们首先检查三个基本抽象中的每一个,抽象的作用、作用方式、接口以及使用名称进行互连的方式。
We begin by examining, for each of the three fundamental abstractions, what the abstraction does, how it does it, its interfaces, and the ways it uses names for interconnection.
内存,有时也称为存储,是用于记住数据值以供计算的系统组件。尽管内存技术范围很广,但正如图 2.1中的示例列表所示,所有内存设备都符合一个简单的抽象模型,该模型具有两个操作,即WRITE和READ:
Memory, sometimes called storage, is the system component that remembers data values for use in computation. Although memory technology is wide-ranging, as suggested by the list of examples in Figure 2.1, all memory devices fit a simple abstract model that has two operations, named WRITE and READ:
图 2.1一些可能熟悉的存储设备的示例。
Figure 2.1 Some examples of memory devices that may be familiar.
写入(名称,值)
WRITE (name, value)
值← READ (名称)
value ← READ (name)
WRITE操作在值中指定要记住的值,在名称中指定将来可以调用该值的名称。READ操作在名称中指定某个先前记住的值的名称,存储设备将返回该值。稍后调用指定相同名称的WRITE会更新与该名称关联的值。
The WRITE operation specifies in value a value to be remembered and in name a name by which one can recall that value in the future. The READ operation specifies in name the name of some previously remembered value, and the memory device returns that value. A later call to WRITE that specifies the same name updates the value associated with that name.
存储器可以是易失性的,也可以是非易失性的。易失性存储器的保存信息机制需要消耗能量;如果由于某种原因电源中断,存储器就会忘记其信息内容。当关闭非易失性存储器(有时称为“稳定存储器”)的电源时,存储器会保留其内容,当电源再次可用时,读取操作将返回与之前相同的值。通过将易失性存储器连接到电池或不间断电源,可以使存储器变得耐用,这意味着存储器被设计为至少在指定的一段时间内记住信息,这被称为耐用性。即使是非易失性存储器设备最终也会变质,这被称为衰减,因此它们通常也具有指定的耐用性,可能以年为单位。我们将在第 8 章 [在线] 和第 10 章 [在线] 中重新讨论耐用性,我们将在其中看到获得不同级别耐用性的方法。边栏 2.1将耐用性的含义与另外两个相关词进行了比较。
Memories can be either volatile or non-volatile. A volatile memory is one whose mechanism of retaining information consumes energy; if its power supply is interrupted for some reason, it forgets its information content. When one turns off the power to a non-volatile memory (sometimes called “stable storage”), it retains its content, and when power is again available, READ operations return the same values as before. By connecting a volatile memory to a battery or an uninterruptible power supply, it can be made durable, which means that it is designed to remember things for at least some specified period, known as its durability. Even non-volatile memory devices are subject to eventual deterioration, known as decay, so they usually also have a specified durability, perhaps measured in years. We will revisit durability in Chapters 8 [on-line] and 10 [on-line], where we will see methods of obtaining different levels of durability. Sidebar 2.1 compares the meaning of durability with two other, related words.
边栏 2.1 术语 耐久性、稳定性和持久性
Sidebar 2.1 Terminology Durability, Stability, and Persistence
无论是在普通英语用法中还是在专业文献中,耐久性、稳定性和持久性这几个术语都存在很多重叠,有时几乎可以互换使用。在本文中,我们以一种强调某些区别的方式定义和使用它们。
Both in common English usage and in the professional literature, the terms durability, stability, and persistence overlap in various ways and are sometimes used almost interchangeably. In this text, we define and use them in a way that emphasizes certain distinctions.
耐久性存储介质的一种属性:其记忆的时间长度。
Durability A property of a storage medium: the length of time it remembers.
稳定性物体的一种属性:它是不变的。
Stability A property of an object: it is unchanging.
持久性主动代理的一个属性:它会不断尝试。
Persistence A property of an active agent: it keeps trying.
因此,本章建议将文件放置在耐用的存储介质中 — 也就是说,它们应该在系统关闭后继续存在,并在需要时保持完整。第 8 章 [在线] 重新讨论了耐用性规范,并根据耐用性要求对应用程序进行了分类。
Thus, the current chapter suggests that files be placed in a durable storage medium—that is, they should survive system shutdown and remain intact for as long as they are needed. Chapter 8 [on-line] revisits durability specifications and classifies applications according to their durability requirements.
本章介绍了名称的稳定绑定的概念,名称一旦确定就不会再改变。
This chapter introduces the concept of stable bindings for names, which, once determined, never again change.
第 7 章 [在线] 介绍了持久发送者的概念,持久发送者是消息交换中的参与者,它不断重新传输消息,直到确认消息已成功接收;第 8 章 [在线] 描述了持久故障,这种故障会持续导致系统故障。
Chapter 7 [on-line] introduces the concept of a persistent sender, a participant in a message exchange who keeps retransmitting a message until it gets confirmation that the message was successfully received, and Chapter 8 [on-line] describes persistent faults, which keep causing a system to fail.
在物理层面,内存系统通常不会命名、读取或写入任意大小的值。相反,硬件层内存设备会读取和写入连续的位数组,这些数组通常长度固定,有各种术语称谓,如字节(通常为 8 位,但有时会遇到 6 位、7 位或 9 位字节的架构)、字(小整数个字节,通常为 2、4 或 8 个)、行(几个字)和块(通常是 2 的幂,可以以千为单位) 。无论数组大小如何,写入或读取的物理层内存的单位称为内存(或存储)单元。在大多数情况下,读取和写入调用中的名称参数实际上是单元的名称。更高层的内存系统也会读取和写入连续的位数组,但这些数组通常可以是任何方便的长度,并用记录、段或文件等术语来调用。
At the physical level, a memory system does not normally name, READ, or WRITE values of arbitrary size. Instead, hardware layer memory devices READ and WRITE contiguous arrays of bits, usually fixed in length, known by various terms such as bytes (usually 8 bits, but one sometimes encounters architectures with 6-, 7-, or 9-bit bytes), words (a small integer number of bytes, typically 2, 4, or 8), lines (several words), and blocks (a number of bytes, usually a power of 2, that can measure in the thousands). Whatever the size of the array, the unit of physical layer memory written or read is known as a memory (or storage) cell. In most cases, the name argument in the read and write calls is actually the name of a cell. Higher-layer memory systems also READ and WRITE contiguous arrays of bits, but these arrays usually can be of any convenient length, and are called by terms such as record, segment, or file.
内存有两个有用的属性:读/写一致性和前后原子性。读/写一致性意味着对指定单元的READ 操作的结果始终与对该单元的最近一次WRITE 操作相同。前后原子性意味着每次READ或WRITE 操作的结果都好像该READ或WRITE操作完全发生在其他READ或WRITE操作之前或之后。尽管设计师似乎应该能够简单地假设这两个属性,但这种假设是有风险的,而且往往是错误的。对读/写一致性和前后原子性的威胁数量惊人:
Two useful properties for a memory are read/write coherence and before-or-after atomicity. Read/write coherence means that the result of the READ of a named cell is always the same as the most recent WRITE to that cell. Before-or-after atomicity means that the result of every READ or WRITE is as if that READ or WRITE occurred either completely before or completely after any other READ or WRITE. Although it might seem that a designer should be able simply to assume these two properties, that assumption is risky and often wrong. There are a surprising number of threats to read/write coherence and before-or-after atomicity:
并发性。在不同参与者可以同时执行READ和WRITE操作的系统中,他们可能同时对同一个命名单元发起两个这样的操作。需要某种仲裁来决定哪个操作先进行,并确保一个操作在另一个操作开始之前完成。
Concurrency. In systems where different actors can perform READ and WRITE operations concurrently, they may initiate two such operations on the same named cell at about the same time. There needs to be some kind of arbitration that decides which one goes first and to ensure that one operation completes before the other begins.
远程存储。当存储设备物理上距离较远时,也会出现同样的问题,但这些问题会因延迟而加剧,这会使“哪个WRITE是最近的?”的问题变得棘手,并且通信链路还会引入其他形式的故障。第 4.5 节介绍了远程存储,第 10 章 [在线] 探讨了远程存储系统中出现的前后原子性和读/写一致性问题的解决方案。
Remote storage. When the memory device is physically distant, the same concerns arise, but they are amplified by delays, which make the question of “which WRITE was most recent?” problematic and by additional forms of failure introduced by communication links. Section 4.5 introduces remote storage, and Chapter 10 [on-line] explores solutions to before-or-after atomicity and read/write coherence problems that arise with remote storage systems.
性能增强。优化编译器和高性能处理器可能会重新安排内存操作的顺序,可能会改变“对该单元的最近一次写入”的含义,从而破坏并发读取和写入操作的读/写一致性。例如,编译器可能会延迟赋值语句暗示的写入操作,直到保存要写入的值的寄存器用于其他目的。如果其他人对该变量执行读取,他们可能会收到旧值。一些编程语言和高性能处理器架构提供特殊的编程指令,允许程序员根据具体情况恢复读/写一致性。例如,Java 语言有一个SYNCHRONIZED声明,可保护代码块免受读/写不一致的影响,而惠普的 Alpha 处理器架构(以及其他)包含一个内存屏障(MB)指令,该指令强制所有前面的读取和写入在继续执行下一条指令之前完成。不幸的是,这两种结构都为程序员犯细微错误创造了机会。
Performance enhancements. Optimizing compilers and high-performance processors may rearrange the order of memory operations, possibly changing the very meaning of “the most recent WRITE to that cell” and thereby destroying read/write coherence for concurrent READ and WRITE operations. For example, a compiler might delay the WRITE operation implied by an assignment statement until the register holding the value to be written is needed for some other purpose. If someone else performs a READ of that variable, they may receive an old value. Some programming languages and high-performance processor architectures provide special programming directives to allow a programmer to restore read/write coherence on a case-by-case basis. For example, the Java language has a SYNCHRONIZED declaration that protects a block of code from read/write incoherence, and Hewlett-Packard’s Alpha processor architecture (among others) includes a memory barrier (MB) instruction that forces all preceding READs and WRITEs to complete before going on to the next instruction. Unfortunately, both of these constructs create opportunities for programmers to make subtle mistakes.
单元大小与值大小不相称。大值可能占用多个内存单元,在这种情况下,前后原子性需要特别注意。问题在于,多单元值的读取和写入通常一次只进行一个单元。与更新相同多单元值的写入器同时运行的读取器最终可能会得到一个混合单元,其中只有一些单元得到了更新。计算机架构师将此称为危险写入撕裂。在写入多单元值的过程中发生的故障会使情况进一步复杂化。要恢复前后原子性,并发读取器和写入器必须以某种方式进行协调,并且更新过程中的故障必须使所有或所有预期更新都保持完整。当满足这些条件时,读取或写入被称为原子的。当一个小值与其他小值共享一个内存单元时,就会出现密切相关的风险。风险在于,如果两个写入器同时更新共享同一单元的不同值,则一个写入器可能会覆盖另一个写入器的更新。原子性也可以解决这个问题。第 5 章通过探索协调并发活动的方法开始研究原子性。第 9 章 [在线] 扩展了原子性研究,还涵盖了故障。
Cell size incommensurate with value size. A large value may occupy multiple memory cells, in which case before-or-after atomicity requires special attention. The problem is that both reading and writing of a multiple-cell value is usually done one cell at a time. A reader running concurrently with a writer that is updating the same multiple-cell value may end up with a mixed bag of cells, only some of which have been updated. Computer architects call this hazard write tearing. Failures that occur in the middle of writing multiple-cell values can further complicate the situation. To restore before-or-after atomicity, concurrent readers and writers must somehow be coordinated, and a failure in the middle of an update must leave either all or none of the intended update intact. When these conditions are met, the READ or WRITE is said to be atomic. A closely related risk arises when a small value shares a memory cell with other small values. The risk is that if two writers concurrently update different values that share the same cell, one may overwrite the other’s update. Atomicity can also solve this problem. Chapter 5 begins the study of atomicity by exploring methods of coordinating concurrent activities. Chapter 9 [on-line] expands the study of atomicity to also encompass failures.
复制存储。第 8 章 [在线] 将详细探讨,通过制作值的多个副本并将这些副本放在不同的存储单元中,可以提高存储的可靠性。存储也可以复制以提高性能,以便多个读取器可以同时操作。但是,复制增加了并发READ和WRITE操作可以交互的方式数量,并且可能丢失读/写一致性或前后原子性。在写入器更新多个副本所需的时间内,已更新副本的读取器可能会从写入器尚未获得的副本的读取器那里获得不同的答案。第 10 章 [在线] 讨论了确保复制存储的读/写一致性和前后原子性的技术。
Replicated storage. As Chapter 8 [on-line] will explore in detail, reliability of storage can be increased by making multiple copies of values and placing those copies in distinct storage cells. Storage may also be replicated for increased performance, so that several readers can operate concurrently. But replication increases the number of ways in which concurrent READ and WRITE operations can interact and possibly lose either read/write coherence or before-or-after atomicity. During the time it takes a writer to update several replicas, readers of an updated replica can get different answers from readers of a replica that the writer hasn’t gotten to yet. Chapter 10 [on-line] discusses techniques to ensure read/write coherence and before-or-after atomicity for replicated storage.
通常,系统设计人员必须同时应对不止一种威胁,而是应对多种威胁。复制和远程性的组合尤其具有挑战性。设计既高效又具有读/写一致性和原子性的内存可能出奇地困难。为了简化设计或实现更高的性能,设计人员有时会构建一致性规范较弱的内存系统。例如,多处理器系统可能会指定:“如果WRITE由同一处理器执行,则READ的结果将是最新WRITE的值。”有整套“数据一致性模型”文献探讨了不同内存一致性规范的详细属性。在分层内存系统中,层的设计人员必须准确了解其使用的任何较低层内存的一致性和原子性规范。反过来,如果正在设计的层为更高层提供内存,则设计人员必须准确指定更高层可以期望和依赖的这两个属性。除非另有说明,我们将假设物理内存设备为单个单元提供读/写一致性,但多单元值(例如,文件)的前后原子性由实现它们的层单独提供。
Often, the designer of a system must cope with not just one but several of these threats simultaneously. The combination of replication and remoteness is particularly challenging. It can be surprisingly difficult to design memories that are both efficient and also read/write coherent and atomic. To simplify the design or achieve higher performance, designers sometimes build memory systems that have weaker coherence specifications. For example, a multiple processor system might specify: “The result of a READ will be the value of the latest WRITE if that WRITE was performed by the same processor.” There is an entire literature of “data consistency models” that explores the detailed properties of different memory coherence specifications. In a layered memory system, it is essential that the designer of a layer know precisely the coherence and atomicity specifications of any lower layer memory that it uses. In turn, if the layer being designed provides memory for higher layers, the designer must specify precisely these two properties that higher layers can expect and depend on. Unless otherwise mentioned, we will assume that physical memory devices provide read/write coherence for individual cells, but that before-or-after atomicity for multicell values (for example, files) is separately provided by the layer that implements them.
内存的一个重要属性是完成读取或写入所需的时间,这被称为延迟(通常称为访问时间,但该术语有更精确的定义,将在侧边栏 6.4中解释)。在磁盘内存中(在侧边栏 2.2中描述),特定扇区的延迟取决于用户请求访问时设备的机械状态。读取一个扇区后,可以测量读取另一个不同但邻近的扇区所需的时间(以微秒为单位)——但前提是用户预计第二次读取并在磁盘旋转经过第二个扇区之前请求读取。仅晚几微秒的请求可能会遇到一千倍的延迟,等待第二个扇区再次在读取头下旋转。因此,将数据传送到磁盘或从磁盘传输数据的最大速率远远大于随机选择扇区时实现的速率。随机存取存储器 (RAM)是指随机选择的存储单元的延迟与以最适合该存储设备的模式选择的单元的延迟大致相同的存储器。电子存储芯片通常配置为随机存取。涉及机械运动的存储设备(例如光盘(CD 和 DVD)、磁带和磁盘)则不是。
An important property of a memory is the time it takes for a READ or a WRITE to complete, which is known as its latency (often called access time, though that term has a more precise definition that will be explained in Sidebar 6.4). In the magnetic disk memory (described in Sidebar 2.2) the latency of a particular sector depends on the mechanical state of the device at the instant the user requests access. Having read a sector, one may measure the time required to also read a different but nearby sector in microseconds—but only if the user anticipates the second read and requests it before the disk rotates past that second sector. A request just a few microseconds late may encounter a delay that is a thousand times longer, waiting for that second sector to again rotate under the read head. Thus the maximum rate at which one can transfer data to or from a disk is dramatically larger than the rate one would achieve when choosing sectors at random. A random access memory (RAM) is one for which the latency for memory cells chosen at random is approximately the same as the latency for cells chosen in the pattern best suited for that memory device. An electronic memory chip is usually configured for random access. Memory devices that involve mechanical movement, such as optical disks (CDs and DVDs) and magnetic tapes and disks, are not.
补充说明 2.2 磁盘的工作原理
Sidebar 2.2 How Magnetic Disks Work
磁盘由旋转的圆形盘片组成,盘片两面都涂有磁性材料(例如氧化铁)。一种称为磁盘头的电磁铁通过在盘片表面的一小块区域内排列粒子的磁场来记录信息。当盘片旋转时,同一个磁盘头通过感应排列粒子的极性来读取数据。磁盘以恒定的速率连续旋转,磁盘头实际上漂浮在盘片旋转产生的气垫上,仅比磁盘表面高出几纳米。
Magnetic disks consist of rotating circular platters coated on both sides with a magnetic material such as ferric oxide. An electromagnet called a disk head records information by aligning the magnetic field of the particles in a small region on the platter’s surface. The same disk head reads the data by sensing the polarity of the aligned particles as the platter spins by. The disk spins continuously at a constant rate, and the disk head actually floats just a few nanometers above the disk surface on an air cushion created by the rotation of the platter.
磁盘头可以从盘片上方的一个位置读取或写入一组位,称为轨道,位于与中心的恒定距离处。在下面的顶视图中,阴影区域表示轨道。通过在轨道周围定期写入分隔标记,轨道被格式化为大小相等的块,称为扇区。由于所有扇区的大小相同,因此外部轨道的扇区多于内部轨道。
From a single position above a platter, a disk head can read or write a set of bits, called a track, located a constant distance from the center. In the top view below, the shaded region identifies a track. Tracks are formatted into equal-sized blocks, called sectors, by writing separation marks periodically around the track. Because all sectors are the same size, the outer tracks have more sectors than the inner ones.
典型的现代磁盘模块(由于其盘片由刚性材料制成,因此称为“硬盘”)包含多个盘片,这些盘片围绕一个称为主轴的公共轴旋转,如上图侧视图所示。每个盘片表面有一个磁头,磁头安装在梳状结构上,该结构使磁头在盘片上同步移动。移动到特定磁道的过程称为寻道,该梳状结构称为寻道臂。寻道臂处于一个位置时可以读取或写入的磁道集(例如,侧视图中的阴影区域)称为柱面。磁道、盘片和扇区均有编号。因此,扇区通过几何坐标来寻址:磁道号、盘片号和旋转位置。现代磁盘控制器通常在内部进行几何映射,并向客户端提供由连续编号的扇区组成的地址空间。
A typical modern disk module, known as a “hard drive” because its platters are made of a rigid material, contains several platters spinning on a common axis called a spindle, as in the side view above. One disk head per platter surface is mounted on a comb-like structure that moves the heads in unison across the platters. Movement to a specific track is called seeking, and the comb-like structure is known as a seek arm. The set of tracks that can be read or written when the seek arm is in one position (for example, the shaded regions of the side view) is called a cylinder. Tracks, platters, and sectors are each numbered. A sector is thus addressed by geometric coordinates: track number, platter number, and rotational position. Modern disk controllers typically do the geometric mapping internally and present their clients with an address space consisting of consecutively numbered sectors.
要读取或写入特定扇区,磁盘控制器首先会寻找所需的磁道。一旦寻道臂就位,控制器就会等待所需扇区的开头在磁盘头下方旋转,然后激活所需盘片上的磁头。在模拟磁域中物理编码数字数据通常需要控制器写入完整的扇区。
To read or write a particular sector, the disk controller first seeks the desired track. Once the seek arm is in position, the controller waits for the beginning of the desired sector to rotate under the disk head, and then it activates the head on the desired platter. Physically encoding digital data in analog magnetic domains usually requires that the controller write complete sectors.
磁盘访问所需的时间称为延迟,该术语在第 6 章中进行了更精确的定义。移动寻道臂需要时间。供应商声称寻道时间为 5 到 10 毫秒,但这是所有可能的寻道臂移动的平均值。从一个柱面移动到下一个柱面可能只需要从最内圈到最外圈磁道移动时间的 1/20。特定扇区在磁头下旋转也需要时间。典型的磁盘旋转速率为 7200 rpm,盘片每 8.3 毫秒旋转一次。传输数据的时间取决于磁记录密度、旋转速率、柱面数(外圈柱面的传输速率可能更高)以及读取或写入的位数。容纳 40 GB 的盘片以每秒 300 到 600 兆比特之间的速率传输数据;因此 1 KB 的扇区需要一到两微秒的传输时间。寻道时间和旋转延迟受到机械工程因素的限制,并且往往只能缓慢改善,但磁记录密度取决于材料技术,而材料技术多年来一直在稳步快速地改进。
The time required for disk access is called latency, a term defined more precisely in Chapter 6. Moving a seek arm takes time. Vendors quote seek times of 5 to 10 milliseconds, but that is an average over all possible seek arm moves. A move from one cylinder to the next may require only 1/20 of the time of a move from the innermost to the outermost track. It also takes time for a particular sector to rotate under the disk head. A typical disk rotation rate is 7200 rpm, for which the platter rotates once in 8.3 milliseconds. The time to transfer the data depends on the magnetic recording density, the rotation rate, the cylinder number (outer cylinders may transfer at higher rates), and the number of bits read or written. A platter that holds 40 gigabytes transfers data at rates between 300 and 600 megabits per second; thus a 1-kilobyte sector transfers in a microsecond or two. Seek time and rotation delay are limited by mechanical engineering considerations and tend to improve only slowly, but magnetic recording density depends on materials technology, which has improved both steadily and rapidly for many years.
早期的磁盘系统存储 20 到 80 兆字节。20 世纪 70 年代,IBM 发明家 Kenneth Haughton 描述了一种新技术,即将磁盘盘片放在密封的外壳中以避免污染。最初的实现在两个主轴上各存储 30 兆字节,这种配置称为 30-30 驱动器。Haughton 将其昵称为“温彻斯特”,以温彻斯特 30-30 步枪命名。这个代号沿用至今,多年来硬盘被称为温彻斯特驱动器。多年来,温彻斯特驱动器的体积越来越小,同时容量也越来越大。
Early disk systems stored between 20 and 80 megabytes. In the 1970s Kenneth Haughton, an IBM inventor, described a new technique of placing disk platters in a sealed enclosure to avoid contamination. The initial implementation stored 30 megabytes on each of two spindles, in a configuration known as a 30–30 drive. Haughton nicknamed it the “Winchester”, after the Winchester 30–30 rifle. The code name stuck, and for many years hard drives were known as Winchester drives. Over the years, Winchester drives have gotten physically smaller while simultaneously evolving to larger capacities.
对于不提供随机访问的设备,在延迟将机械部件移到位的代价下,通常最好是读取或写入大块数据。大块读取和写入操作有时分别重新标记为GET和PUT,本书也采用这种惯例。传统上,不合格的术语“内存”表示随机访问易失性存储器,而术语“存储”用于以大块形式使用GET和PUT进行读取和写入的非易失性存储器。实际上,此命名规则有足够多的例外情况,以至于“内存”和“存储”这两个词几乎可以互换。
For devices that do not provide random access, it is usually a good idea, having paid the cost in delay of moving the mechanical components into position, to READ or WRITE a large block of data. Large-block READ and WRITE operations are sometimes relabeled GET and PUT, respectively, and this book uses that convention. Traditionally, the unqualified term memory meant random-access volatile memory and the term storage was used for non-volatile memory that is read and written in large blocks with GET and PUT. In practice, there are enough exceptions to this naming rule that the words “memory” and “storage” have become almost interchangeable.
存储设备的物理实现几乎总是根据存储单元的物理存储位置的几何坐标来命名存储单元。因此,例如,电子存储芯片被组织为一个二维触发器阵列,每个触发器保存一个命名位。访问机制将位名称分成两部分,然后转到一对多路复用器。一个多路复用器选择 x 坐标,另一个多路复用器选择 y 坐标,这两个坐标依次选择保存该位的特定触发器。类似地,在磁盘存储器中,名称的一个组成部分以电气方式选择其中一个记录盘片,而名称的不同组成部分选择寻道臂的位置,从而选择该盘片上的特定轨道。第三个名称组成部分选择该轨道上的特定扇区,可以通过从标识第一个扇区的索引标记开始,对经过读取头的扇区进行计数来识别该扇区。
Physical implementations of memory devices nearly always name a memory cell by the geometric coordinates of its physical storage location. Thus, for example, an electronic memory chip is organized as a two-dimensional array of flip-flops, each holding one named bit. The access mechanism splits the bit name into two parts, which in turn go to a pair of multiplexers. One multiplexer selects an x-coordinate, the other a y-coordinate, and the two coordinates in turn select the particular flip-flop that holds that bit. Similarly, in a magnetic disk memory, one component of the name electrically selects one of the recording platters, while a distinct component of the name selects the position of the seek arm, thereby choosing a specific track on that platter. A third name component selects a particular sector on that track, which may be identified by counting sectors as they pass under the read head, starting from an index mark that identifies the first sector.
设计将几何坐标映射到由连续整数(0、1、2 等)组成的名称集或从该名称集映射的硬件是很容易的。这些连续的整数名称称为地址,它们构成了存储设备的地址空间。使用由连续整数集组成名称的存储系统称为位置寻址存储器。由于地址是连续的,因此命名的存储单元的大小不必与读取或写入的单元的大小相同。在某些存储器架构中,每个字节都有一个不同的地址,但读取和写入可以(在某些情况下必须始终)以更大的单位(例如字或一行)进行。
It is easy to design hardware that maps geometric coordinates to and from sets of names consisting of consecutive integers (0, 1, 2, etc.). These consecutive integer names are called addresses, and they form the address space of the memory device. A memory system that uses names that are sets of consecutive integers is called a location-addressed memory. Because the addresses are consecutive, the size of the memory cell that is named does not have to be the same as the size of the cell that is read or written. In some memory architectures each byte has a distinct address, but reads and writes can (and in some cases must always) occur in larger units, such as a word or a line.
对于大多数应用来说,连续整数并不是人们在调用数据时会选择的名称。人们通常更愿意选择约束较少的名称。接受不受约束名称的存储系统称为联想存储器。由于物理存储器通常是位置寻址的,因此设计人员通过插入一个联想层来创建联想存储器,该联想层可以用硬件或软件实现,将不受约束的高级名称映射到底层位置寻址存储器的受约束整数名称,如图2.2所示。软件联想存储器的例子是建立在一个或多个底层位置寻址存储器之上的,包括个人电话簿、文件系统和公司数据库系统。缓存是一种记住昂贵计算结果的设备,希望在很快再次需要时不必重新进行该计算,有时以软件或硬件的形式实现为联想存储器。(缓存的设计在第 6.2 节中讨论。)
For most applications, consecutive integers are not exactly the names that one would choose for recalling data. One would usually prefer to be allowed to choose less constrained names. A memory system that accepts unconstrained names is called an associative memory. Since physical memories are generally location-addressed, a designer creates an associative memory by interposing an associativity layer, which may be implemented either with hardware or software, that maps unconstrained higher-level names to the constrained integer names of an underlying location-addressed memory, as in Figure 2.2. Examples of software associative memories, constructed on top of one or more underlying location-addressed memories, include personal telephone directories, file systems, and corporate database systems. A cache, a device that remembers the result of an expensive computation in the hope of not redoing that computation if it is needed again soon, is sometimes implemented as an associative memory, either in software or hardware. (The design of caches is discussed in Section 6.2.)
图 2.2两层实现的联想存储器。联想层将其参数的无约束名称映射到物理层位置寻址存储器所需的连续整数地址。
Figure 2.2 An associative memory implemented in two layers. The associativity layer maps the unconstrained names of its arguments to the consecutive integer addresses required by the physical layer location-addressed memory.
提供关联性和名称映射的层在所有内存和存储系统的设计中都占有重要地位。例如,第 93 页的表 2.2列出了UNIX文件系统的层。有关内存抽象分层的另一个示例,第 5 章解释了如何通过添加名称映射层来虚拟化内存。
Layers that provide associativity and name mapping figure strongly in the design of all memory and storage systems. For example, Table 2.2 on page 93 lists the layers of the UNIX file system. For another example of layering of memory abstractions, Chapter 5 explains how memory can be virtualized by adding a name-mapping layer.
回到抽象的主题,RAID 系统展示了模块化的强大功能,以及如何有效应用存储抽象。RAID 是独立(或廉价)磁盘冗余阵列的缩写。RAID 系统由一组磁盘驱动器和一个控制器组成,该控制器配置了与单个磁盘驱动器接口相同的电气和编程接口,如图2.3所示。RAID 控制器拦截通过其接口发出的READ和WRITE请求,并将它们定向到一个或多个磁盘。RAID 有两个不同的目标:
Returning to the subject of abstraction, a system known as RAID provides an illustration of the power of modularity and of how the storage abstraction can be applied to good effect. RAID is an acronym for Redundant Array of Independent (or Inexpensive) Disks. A RAID system consists of a set of disk drives and a controller configured with an electrical and programming interface that is identical to the interface of a single disk drive, as shown in Figure 2.3. The RAID controller intercepts READ and WRITE requests coming across its interface, and it directs them to one or more of the disks. RAID has two distinct goals:
图 2.3 RAID 中的抽象。RAID系统的读/写电气和编程接口(以实线箭头表示)与单个磁盘的读/写电气和编程接口相同。
Figure 2.3 Abstraction in RAID. The READ/WRITE electrical and programming interface of the RAID system, represented by the solid arrow, is identical to that of a single disk.
Improved performance, by reading or writing disks concurrently
Improved durability, by writing information on more than one disk
不同的 RAID 配置在这些目标之间提供不同的权衡。无论设计人员选择哪种权衡,由于接口抽象是单个磁盘的抽象,程序员可以利用性能和耐用性的改进而无需重新编程。
Different RAID configurations offer different trade-offs between these goals. Whatever trade-off the designer chooses, because the interface abstraction is that of a single disk, the programmer can take advantage of the improvements in performance and durability without reprogramming.
某些有用的 RAID 配置传统上是用 (某种程度上任意的) 数字来标识的。在后面的章节中,我们将遇到几种这样的编号配置。称为 RAID 0 (见第 6.1.5 节) 的配置通过允许并发读写来提高性能。称为 RAID 4 (见图 8.6 [在线]) 的配置通过应用纠错码提高了磁盘可靠性。还有另一种称为 RAID 1 (见第 8.5.4.6 节 [在线]) 的配置通过在不同的磁盘上制作数据的相同副本来提供高耐用性。练习 8.8 [在线] 探讨了一种简单但优雅的性能优化,即 RAID 5。这些和其他几种 RAID 配置最初在 Randy Katz、Garth Gibson 和 David Patterson 的论文中进行了深入描述,他们还将传统数字分配给了不同的配置 [参见进一步阅读建议 10.2.2 ]。
Certain useful RAID configurations are traditionally identified by (somewhat arbitrary) numbers. In later chapters, we will encounter several of these numbered configurations. The configuration known as RAID 0 (in Section 6.1.5) provides increased performance by allowing concurrent reading and writing. The configuration known as RAID 4 (shown in Figure 8.6 [on-line]) improves disk reliability by applying error-correction codes. Yet another configuration known as RAID 1 (in Section 8.5.4.6 [on-line]) provides high durability by making identical copies of the data on different disks. Exercise 8.8 [on-line] explores a simple but elegant performance optimization known as RAID 5. These and several other RAID configurations were originally described in depth in a paper by Randy Katz, Garth Gibson, and David Patterson, who also assigned the traditional numbers to the different configurations [see Suggestions for Further Reading 10.2.2].
解释器是计算机系统的活跃元素;它们执行构成计算的操作。图 2.4列出了一些可能熟悉的解释器示例。与内存一样,解释器也有各种各样的物理表现形式。但是,它们也可以用一个简单的抽象来描述,仅由三个组件组成:
Interpreters are the active elements of a computer system; they perform the actions that constitute computations. Figure 2.4 lists some examples of interpreters that may be familiar. As with memory, interpreters also come in a wide range of physical manifestations. However, they too can be described with a simple abstraction, consisting of just three components:
图 2.4解释器的一些常见示例。磁盘控制器示例在2.3 节中进行了解释,而 Web 浏览器示例则是练习 4.5的主题。
Figure 2.4 Some common examples of interpreters. The disk controller example is explained in Section 2.3 and the Web browser examples are the subject of Exercise 4.5.
1. An instruction reference, which tells the interpreter where to find its next instruction
2.指令表,定义解释器在从指令引用指定的位置检索指令时准备执行的一组操作
2. A repertoire, which defines the set of actions the interpreter is prepared to perform when it retrieves an instruction from the location named by the instruction reference
3.环境引用,它告诉解释器在哪里找到它的环境,以及解释器应在当前状态下执行当前指令的操作
3. An environment reference, which tells the interpreter where to find its environment, the current state on which the interpreter should perform the action of the current instruction
解释器的正常操作是按顺序执行某个程序,如图2.5的图表和伪代码所示。解释器使用环境引用找到当前环境,从该环境中检索指令引用中指示的程序指令。再次使用环境引用,解释器执行程序指令指示的操作。该操作通常涉及使用和更改环境中的数据,以及对指令引用进行适当更新。执行完指令后,解释器继续执行,将指令引用现在命名的指令作为下一条指令。某些事件(称为中断)可能会引起解释器的注意,导致它(而不是程序)提供下一条指令。原始程序不再控制解释器;相反,另一个程序(中断处理程序)接管并处理事件。解释器还可以将环境引用更改为适合中断处理程序的引用。
The normal operation of an interpreter is to proceed sequentially through some program, as suggested by the diagram and pseudocode of Figure 2.5. Using the environment reference to find the current environment, the interpreter retrieves from that environment the program instruction indicated in the instruction reference. Again using the environment reference, the interpreter performs the action directed by the program instruction. That action typically involves using and perhaps changing data in the environment, and also an appropriate update of the instruction reference. When it finishes performing the instruction, the interpreter moves on, taking as its next instruction the one now named by the instruction reference. Certain events, called interrupts, may catch the attention of the interpreter, causing it, rather than the program, to supply the next instruction. The original program no longer controls the interpreter; instead, a different program, the interrupt handler, takes control and handles the event. The interpreter may also change the environment reference to one that is appropriate for the interrupt handler.
图 2.5抽象解释器的结构和伪代码。实线箭头表示控制流,虚线箭头表示信息流。边栏 2.3描述了本书表达伪代码的惯例。
Figure 2.5 Structure of, and pseudocode for, an abstract interpreter. Solid arrows show control flow, and dashed arrows suggest information flow. Sidebar 2.3 describes this book’s conventions for expressing pseudocode.
边栏 2.3 表示伪代码和消息
Sidebar 2.3 Representation Pseudocode and Messages
本书介绍了许多程序片段示例。其中大多数都以伪代码表示,伪代码是一种虚构的编程语言,它会根据需要采用不同现有编程语言中熟悉的功能,并偶尔穿插英文文本来描述某些步骤,而这些步骤的确切细节并不重要。伪代码具有一些标准功能,本简短示例展示了其中的几个。
This book presents many examples of program fragments. Most of them are represented in pseudocode, an imaginary programming language that adopts familiar features from different existing programming languages as needed and that occasionally intersperses English text to characterize some step whose exact detail is unimportant. The pseudocode has some standard features, several of which this brief example shows.
1 过程 sum ( a , b ) // 将两个数字相加。
1 procedure sum (a, b) // Add two numbers.
共2 ← a + b
2 total ← a + b
3 回 总
3 return total
左边的行号不是伪代码的一部分;它们只是为了让文本引用程序中的行。过程是明确声明的(如第1行),缩进将语句块组合在一起。程序变量用斜体表示,程序关键字用粗体表示,文字(例如过程名称和内置常量)用小写字母表示。左箭头表示替换或赋值(第2行),符号“=”表示条件表达式中的相等。双斜杠位于不属于伪代码的注释前面。当各种形式的迭代(while、till、for each、do 偶尔)、条件(if)、集合运算(is in)和 case 语句(do case )有助于表达示例时,就会出现它们。从0到3 的j的 构造迭代四次;除非另有说明,否则数组索引从 0 开始。构造yx表示名为y 的结构中名为x 的元素。为了尽量减少混乱,伪代码省略了从上下文中可以合理看出含义的声明。除非出现声明引用,否则过程参数按值传递。本章 2.2.1节讨论了按值使用和按引用使用之间的区别。当多个变量使用同一结构时,可以使用声明结构名称实例变量名称。
The line numbers on the left are not part of the pseudocode; they are there simply to allow the text to refer to lines in the program. Procedures are explicitly declared (as in line 1), and indentation groups blocks of statements together. Program variables are set in italic, program key words in bold, and literals such as the names of procedures and built-in constants in SMALL CAPS. The left arrow denotes substitution or assignment (line 2) and the symbol “=” denotes equality in conditional expressions. The double slash precedes comments that are not part of the pseudocode. Various forms of iteration (while, until, for each, do occasionally), conditionals (if), set operations (is in), and case statements (do case) appear when they are helpful in expressing an example. The construction for j from 0 to 3 iterates four times; array indices start at 0 unless otherwise mentioned. The construction y.x means the element named x in the structure named y. To minimize clutter, the pseudocode omits declarations wherever the meaning is reasonably apparent from the context. Procedure parameters are passed by value unless the declaration reference appears. Section 2.2.1 of this chapter discusses the distinction between use by value and use by reference. When more than one variable uses the same structure, the declaration structure_name instance variable_name may be used.
符号a (11…15) 表示从字符串a(或从被视为字符串的变量a)中提取 11 到 15 位。位从左到右按零开始编号,整数的最高有效位在前(使用大端表示法,如边栏 4.3中所述)。当将 + 运算符应用于字符串时,会将字符串连接起来。
The notation a(11…15) denotes extraction of bits 11 through 15 from the string a (or from the variable a considered as a string). Bits are numbered left to right starting with zero, with the most significant bit of integers first (using big-endian notation, as described in Sidebar 4.3). The + operator, when applied to strings, concatenates the strings.
一些示例以假想精简指令集计算机 (RISC) 的指令库表示。由于此类程序很繁琐,因此只有在必须展示软件如何与硬件交互时才会出现。
Some examples are represented in the instruction repertoire of an imaginary reduced instruction set computer (RISC). Because such programs are cumbersome, they appear only when it is essential to show how software interacts with hardware.
在描述和使用通信链路时,符号
In describing and using communication links, the notation
表示从发送者x到接收者y的内容为M 的消息。符号{ a , b , c }表示包含三个命名字段的消息,这些字段以某种方式编组,接收者大概知道如何解组。
represents a message with contents M from sender x to recipient y. The notation {a, b, c} represents a message that contains the three named fields marshaled in some way that the recipient presumably understands how to unmarshal.
许多系统都有多个解释器。多个解释器通常是异步的,这意味着它们在单独的、不协调的时钟上运行。因此,即使它们名义上相同并运行相同的程序,它们也可能以不同的速率进行。在设计协调多个解释器工作的算法时,通常假设它们的进度率之间没有固定的关系,因此无法预测相对时间,例如,它们发出的LOAD和STORE指令的相对时间。解释器异步的假设是内存读/写一致性和前后原子性可能成为具有挑战性的设计问题的原因之一。
Many systems have more than one interpreter. Multiple interpreters are usually asynchronous, which means that they run on separate, uncoordinated, clocks. As a result, they may progress at different rates, even if they are nominally identical and running the same program. In designing algorithms that coordinate the work of multiple interpreters, one usually assumes that there is no fixed relation among their progress rates and therefore that there is no way to predict the relative timing, for example, of the LOAD and STORE instructions that they issue. The assumption of interpreter asynchrony is one of the reasons memory read/write coherence and before-or-after atomicity can be challenging design problems.
通用处理器是解释器的一种实现。为了便于本书具体讨论,我们使用典型的精简指令集处理器。处理器的指令引用是一个程序计数器,存储在处理器内部的快速内存寄存器中。程序计数器包含存储当前程序下一条指令的内存位置的地址。处理器的环境引用部分由少量内置位置寻址内存组成,这些内存以命名(按数字)寄存器的形式存在,用于快速访问计算的临时结果。
A general-purpose processor is an implementation of an interpreter. For purposes of concrete discussion throughout this book, we use a typical reduced instruction set processor. The processor’s instruction reference is a program counter, stored in a fast memory register inside the processor. The program counter contains the address of the memory location that stores the next instruction of the current program. The environment reference of the processor consists in part of a small amount of built-in location-addressed memory in the form of named (by number) registers for fast access to temporary results of computations.
我们的通用处理器可以直接连接到内存,这也是其环境的一部分。程序计数器和指令中的地址是该内存地址空间中的名称,因此环境引用的这一部分是连接的并且不可更改的。当我们在第 5 章讨论虚拟化时,我们将扩展处理器以通过一个或多个寄存器间接引用内存。通过这种更改,环境引用将保留在这些寄存器中,从而允许处理器发出的地址映射到内存地址空间中的不同名称。
Our general-purpose processor may be directly wired to a memory, which is also part of its environment. The addresses in the program counter and in instructions are then names in the address space of that memory, so this part of the environment reference is wired in and unchangeable. When we discuss virtualization in Chapter 5, we will extend the processor to refer to memory indirectly via one or more registers. With that change, the environment reference is maintained in those registers, thus allowing addresses issued by the processor to map to different names in the address space of the memory.
我们的通用处理器的指令库包括用于表达计算的指令,例如将两个数字相加(ADD)、从一个数字中减去另一个数字(SUB)、比较两个数字(CMP)以及将程序计数器更改为另一条指令的地址(JMP)。这些指令对存储在处理器命名寄存器中的值进行操作,这就是为什么它们被通俗地称为“操作码”。
The repertoire of our general-purpose processor includes instructions for expressing computations such as adding two numbers (ADD), subtracting one number from another (SUB), comparing two numbers (CMP), and changing the program counter to the address of another instruction (JMP). These instructions operate on values stored in the named registers of the processor, which is why they are colloquially called “op-codes”.
该指令集还包括在处理器寄存器和内存之间移动数据的指令。为了将程序指令与内存操作区分开来,我们使用名称LOAD表示将值从命名的内存单元读取到处理器寄存器中的指令,使用STORE表示将值从寄存器写入到命名的内存单元中的指令。这些指令采用两个整数参数,即内存单元的名称和处理器寄存器的名称。
The repertoire also includes instructions to move data between processor registers and memory. To distinguish program instructions from memory operations, we use the name LOAD for the instruction that READs a value from a named memory cell into a register of the processor and STORE for the instruction that WRITEs the value from a register into a named memory cell. These instructions take two integer arguments, the name of a memory cell and the name of a processor register.
通用处理器提供了堆栈,这是一种存储在内存中并用于实现过程调用的下推数据结构。调用过程时,调用者将被调用过程(被调用者)的参数推送到堆栈上。当被调用者返回时,调用者将堆栈弹出回到其先前的大小。此过程实现支持递归调用,因为每次调用过程时,其参数总是位于堆栈顶部。我们专门使用一个寄存器来高效实现堆栈操作。此寄存器称为堆栈指针,保存堆栈顶部的内存地址。
The general-purpose processor provides a stack, a push-down data structure that is stored in memory and used to implement procedure calls. When calling a procedure, the caller pushes arguments of the called procedure (the callee) on the stack. When the callee returns, the caller pops the stack back to its previous size. This implementation of procedures supports recursive calls because every invocation of a procedure always finds its arguments at the top of the stack. We dedicate one register for implementing stack operations efficiently. This register, known as the stack pointer, holds the memory address of the top of the stack.
作为解释指令的一部分,处理器会增加程序计数器,这样当该指令完成时,程序计数器就会包含程序的下一条指令的地址。如果正在解释的指令是 JMP ,则该指令会将新值加载到程序计数器中。在这两种情况下,指令解释流程都由正在运行的程序控制。
As part of interpreting an instruction, the processor increments the program counter so that, when that instruction is complete, the program counter contains the address of the next instruction of the program. If the instruction being interpreted is a JMP, that instruction loads a new value into the program counter. In both cases, the flow of instruction interpretation is under control of the running program.
处理器还实现中断。中断可能是因为处理器检测到正在运行的程序存在问题(例如,程序试图执行解释器未实现或无法实现的指令,如除以零)。中断也可能是因为来自处理器外部的信号到达,表示某个外部设备需要注意(例如,键盘发出信号表示可以按下某个键)。在第一种情况下,中断机制可能会将控制权转移到程序其他地方的异常处理程序。在第二种情况下,中断处理程序可能会执行一些工作,然后将控制权返回给原始程序。我们将在第 5 章讨论线程时回到中断的主题以及中断处理程序和异常处理程序之间的区别。
The processor also implements interrupts. An interrupt can occur because the processor has detected some problem with the running program (e.g., the program attempted to execute an instruction that the interpreter does not or cannot implement, such as dividing by zero). An interrupt can also occur because a signal arrives from outside the processor, indicating that some external device needs attention (e.g., the keyboard signals that a key press is available). In the first case, the interrupt mechanism may transfer control to an exception handler elsewhere in the program. In the second case, the interrupt handler may do some work and then return control to the original program. We shall return to the subject of interrupts and the distinction between interrupt handlers and exception handlers in the discussion of threads in Chapter 5.
除了通用处理器之外,计算机系统通常还具有专用处理器,这些处理器的功能有限。例如,时钟芯片是一种简单的硬连线解释器,它只进行计数:在某个指定频率下,它执行ADD指令,该指令将与时钟相对应的寄存器或内存位置的内容加 1。所有处理器,无论是通用处理器还是专用处理器,都是解释器的示例。但是,它们提供的指令可能存在很大差异。必须查阅设备制造商的手册才能了解指令。
In addition to general-purpose processors, computer systems typically also have special-purpose processors, which have a limited repertoire. For example, a clock chip is a simple, hard-wired interpreter that just counts: at some specified frequency, it executes an ADD instruction, which adds 1 to the contents of a register or memory location that corresponds to the clock. All processors, whether general-purpose or specialized, are examples of interpreters. However, they may differ substantially in the repertoire they provide. One must consult the device manufacturer’s manual to learn the repertoire.
解释器几乎总是按层组织。最低层通常是具有相当原始的指令库的硬件引擎,而后续层则提供越来越丰富或专门的指令库。一个成熟的应用系统可能涉及四到五个不同的解释层。在任何给定的层接口上,下层向上层呈现一些可能的指令库。图 2.6说明了此模型。
Interpreters are nearly always organized in layers. The lowest layer is usually a hardware engine that has a fairly primitive repertoire of instructions, and successive layers provide an increasingly rich or specialized repertoire. A full-blown application system may involve four or five distinct layers of interpretation. Across any given layer interface, the lower layer presents some repertoire of possible instructions to the upper layer. Figure 2.6 illustrates this model.
图 2.6分层解释器的模型。每个层接口(以虚线表示)代表一个抽象屏障,上层程序通过该屏障请求执行下层指令库中的指令。下层程序通常通过执行下一个下层接口指令库中的多条指令来实现一条指令。
Figure 2.6 The model for a layered interpreter. Each layer interface, shown as a dashed line, represents an abstraction barrier, across which an upper layer procedure requests execution of instructions from the repertoire of the lower layer. The lower layer procedure typically implements an instruction by performing several instructions from the repertoire of a next lower layer interface.
例如,考虑一个日历管理程序。通过移动和单击鼠标发出请求的人将日历程序视为鼠标手势的解释器。指令引用告诉解释器从键盘和鼠标获取其下一个指令。指令库是一组可用的请求 - 添加新事件、插入一些描述性文本、更改小时或打印当天事件的列表。环境是一组文件,用于记住每天的日历。
Consider, for example, a calendar management program. The person making requests by moving and clicking a mouse views the calendar program as an interpreter of the mouse gestures. The instruction reference tells the interpreter to obtain its next instruction from the keyboard and mouse. The repertoire of instructions is the set of available requests—to add a new event, to insert some descriptive text, to change the hour, or to print a list of the day’s events. The environment is a set of files that remembers the calendar from day to day.
日历程序通过调用某些编程语言(例如 Java)中的语句来实现用户请求的每个操作。这些语句(例如迭代语句、条件语句、替换语句、过程调用)构成了下一低层的指令库。指令引用跟踪下一个要执行的语句,环境是程序使用的命名变量的集合。(我们在这里假设 Java 语言程序尚未直接编译为机器语言。如果使用编译器,则会少一层。)
The calendar program implements each action requested by the user by invoking statements in some programming language such as Java. These statements—such as iteration statements, conditional statements, substitution statements, procedure calls—constitute the instruction repertoire of the next lower layer. The instruction reference keeps track of which statement is to be executed next, and the environment is the collection of named variables used by the program. (We are assuming here that the Java language program has not been compiled directly to machine language. If a compiler is used, there would be one less layer.)
编程语言的动作又由某些通用处理器的硬件机器语言指令来实现,这些指令有自己的指令引用、指令集和环境引用。
The actions of the programming language are in turn implemented by hardware machine language instructions of some general-purpose processor, with its own instruction reference, repertoire, and environment reference.
图 2.7说明了刚才描述的三个层次。实际上,分层结构可能更深 — 日历程序可能由一个内部上层和一个下层组成,其中上层解释图形手势,下层操作日历数据;Java 解释器可能有一个中间字节码解释器层;一些机器语言在硬件门层之上有一个微码解释器层。
Figure 2.7 illustrates the three layers just described. In practice, the layered structure may be deeper—the calendar program is likely to be organized with an internal upper layer that interprets the graphical gestures and a lower layer that manipulates the calendar data, the Java interpreter may have an intermediate byte-code interpreter layer, and some machine languages are implemented with a microcode interpreter layer on top of a layer of hardware gates.
图 2.7具有三层解释的应用系统,每层都有自己的指令集。
Figure 2.7 An application system that has three layers of interpretation, each with its own repertoire of instructions.
分层解释器设计的一个目标是确保每一层的设计者能够确信下一层要么成功完成每条指令,要么什么也不做。即使发生灾难性故障,也不应该担心半完成的指令。该目标是原子性的另一个例子,实现它相对困难。目前,我们只是假设解释器是原子的,我们将如何实现原子性的讨论推迟到第 9 章 [在线]。
One goal in the design of a layered interpreter is to ensure that the designer of each layer can be confident that the layer below either completes each instruction successfully or does nothing at all. Half-finished instructions should never be a concern, even if there is a catastrophic failure. That goal is another example of atomicity, and achieving it is relatively difficult. For the moment, we simply assume that interpreters are atomic, and we defer the discussion of how to achieve atomicity to Chapter 9 [on-line].
通信链路为信息在物理上分离的组件之间移动提供了一种方式。图 2.8列出了通信链路的一些示例,这些链路采用多种技术,但与内存和解释器一样,它们可以用简单的抽象来描述。通信链路抽象有两个操作:
A communication link provides a way for information to move between physically separated components. Communication links, of which a few examples are listed in Figure 2.8, come in a wide range of technologies, but, like memories and interpreters, they can be described with a simple abstraction. The communication link abstraction has two operations:
图 2.8一些通信链路的示例。
Figure 2.8 Some examples of communication links.
发送(链接名称,传出消息缓冲区)
SEND (link_name, outgoing_message_buffer)
接收(链接名称,传入消息缓冲区)
RECEIVE (link_name, incoming_message_buffer)
SEND操作指定一个位数组,称为消息,该数组将通过由link_name标识的通信链路(例如,线路)发送。参数outgoing_message_buffer标识要发送的消息,通常通过提供包含该消息的内存缓冲区的地址和大小来标识。RECEIVE操作接受传入消息,同样通常通过指定用于保存传入消息的内存缓冲区的地址和大小来接受。一旦系统的最低层收到消息,较高层可以通过调用较低层的RECEIVE接口来获取该消息,或者较低层可以“向上调用”较高层,在这种情况下,接口可能更好地描述为DELIVER(incoming_message)。
The SEND operation specifies an array of bits, called a message, to be sent over the communication link identified by link_name (for example, a wire). The argument outgoing_message_buffer identifies the message to be sent, usually by giving the address and size of a buffer in memory that contains the message. The RECEIVE operation accepts an incoming message, again usually by designating the address and size of a buffer in memory to hold the incoming message. Once the lowest layer of a system has received a message, higher layers may acquire the message by calling a RECEIVE interface of the lower layer, or the lower layer may “upcall” to the higher layer, in which case the interface might be better characterized as DELIVER (incoming_message).
名称以两种不同的方式将系统连接到通信链路。首先,SEND和RECEIVE的link_name参数标识连接到系统的可能多个可用通信链路之一。其次,一些通信链路实际上是多连接的链路网络,需要一些额外的方法来命名几个可能的接收者中的哪一个应该接收消息。预期接收者的名称通常是消息的组成部分之一。
Names connect systems to communication links in two different ways. First, the link_name arguments of SEND and RECEIVE identify one of possibly several available communication links attached to the system. Second, some communication links are actually multiply-attached networks of links, and some additional method is needed to name which of several possible recipients should receive the message. The name of the intended recipient is typically one of the components of the message.
乍一看,发送和接收消息似乎只是使用一系列READ和WRITE操作通过线路将一组位从一个内存复制到另一个内存的示例,因此不需要第三个抽象。然而,通信链路涉及的不仅仅是简单的复制——它们有很多复杂性,例如,各种各样的操作参数使得完成 SEND或RECEIVE操作的时间变得不可预测,恶劣的环境威胁着数据传输的完整性,异步操作导致无法提前知道消息的大小和传送时间,最重要的是,消息甚至可能没有被传送。由于这些复杂性,SEND和RECEIVE的语义通常与READ和WRITE相关的语义完全不同。调用SEND和RECEIVE的程序必须明确考虑这些不同的语义。另一方面,一些通信链路实现确实提供了一个层,它尽最大努力将SEND / RECEIVE接口隐藏在READ / WRITE接口后面。
At first glance, it might appear that sending and receiving a message is just an example of copying an array of bits from one memory to another memory over a wire using a sequence of READ and WRITE operations, so there is no need for a third abstraction. However, communication links involve more than simple copying—they have many complications, such as a wide range of operating parameters that makes the time to complete a SEND or RECEIVE operation unpredictable, a hostile environment that threatens integrity of the data transfer, asynchronous operation that leads to the arrival of messages whose size and time of delivery can not be known in advance, and most significant, the message may not even be delivered. Because of these complications, the semantics of SEND and RECEIVE are typically quite different from those associated with READ and WRITE. Programs that invoke SEND and RECEIVE must take these different semantics explicitly into account. On the other hand, some communication link implementations do provide a layer that does its best to hide a SEND/RECEIVE interface behind a READ/WRITE interface.
就像内存和解释器一样,设计人员分层组织和实现通信链路。我们不会在这里继续详细讨论通信链路,而是将其推迟到第 7.2 节 [在线],该节描述了一个三层模型,该模型将通信链路组织成称为网络的系统。图 7.18 [在线] 说明了这个三层网络模型,它包括链路层、网络层和端到端层。
Just as with memory and interpreters, designers organize and implement communication links in layers. Rather than continuing a detailed discussion of communication links here, we defer that discussion to Section 7.2 [on-line], which describes a three-layer model that organizes communication links into systems called networks. Figure 7.18 [on-line] illustrates this three-layer network model, which comprises a link layer, a network layer, and an end-to-end layer.
计算机系统在其构造、配置和操作中以多种方式使用名称。上一节提到了内存地址、处理器寄存器和链接名称,图 2.9列出了几个其他示例,其中一些可能很熟悉,其他一些将在后面的章节中出现。一些系统名称类似于编程语言的名称,而其他系统名称则截然不同。当用子系统构建系统时,必须能够使用子系统而不必知道该子系统如何引用其组件的细节。因此,名称用于实现模块化,同时,模块化有时必须隐藏名称。
Computer systems use names in many ways in their construction, configuration, and operation. The previous section mentioned memory addresses, processor registers, and link names, and Figure 2.9 lists several additional examples, some of which are probably familiar, others of which will turn up in later chapters. Some system names resemble those of a programming language, whereas others are quite different. When building systems out of subsystems, it is essential to be able to use a subsystem without having to know details of how that subsystem refers to its components. Names are thus used to achieve modularity, and at the same time, modularity must sometimes hide names.
图 2.9系统中使用的名称示例。
Figure 2.9 Examples of names used in systems.
我们从对象的角度来理解名称:计算机系统操纵对象。解释器在程序的控制下或在人类用户的指导下执行操纵。对象可以是结构化的,这意味着它使用其他对象作为组件。与过程传递参数的两种方式直接类比,有两种方法可以安排一个对象使用另一个对象作为组件:
We approach names from an object point of view: the computer system manipulates objects. An interpreter performs the manipulation under control of a program or perhaps under the direction of a human user. An object may be structured, which means that it uses other objects as components. In a direct analogy with two ways in which procedures can pass arguments, there are two ways to arrange for one object to use another as a component:
create a copy of the component object and include the copy in the using object (use by value), or
为组件对象选择一个名称,并在使用对象中包含该名称(通过引用使用)。组件对象被称为导出名称。
choose a name for the component object and include just that name in the using object (use by reference). The component object is said to export the name.
将参数传递给过程时,按值使用可增强模块化,因为如果被调用方意外修改了参数,也不会影响原始参数。但按值使用可能会有问题,因为它不容易允许两个或多个对象共享值发生变化的组件对象。如果对象 A 和 B 都按值使用对象 C,那么更改 C 的值是一个毫无意义或难以实现的概念——它可能需要追踪 A 和 B 中包含的两个 C 副本来更新它们。同样,在过程调用中,让被调用方能够修改原始对象有时很有用,因此大多数编程语言都提供了某种方式来传递名称(本文中的伪代码为此目的使用了引用声明)而不是值。因此,名称的一个目的是允许按引用使用,从而简化可变对象的共享。
When passing arguments to procedures, use by value enhances modularity, because if the callee accidentally modifies the argument it does not affect the original. But use by value can be problematic because it does not easily permit two or more objects to share a component object whose value changes. If both object A and B use object C by value, then changing the value of C is a concept that is either meaningless or difficult to implement—it could require tracking down the two copies of C included in A and B to update them. Similarly, in procedure calls it is sometimes useful to give the callee the ability to modify the original object, so most programming languages provide some way to pass the name (pseudocode in this text uses the reference declaration for that purpose) rather than the value. One purpose of names, then, is to allow use by reference and thus simplify the sharing of changeable objects.
共享说明了名称的一个基本用途:作为沟通和组织工具。由于同一名称的两次使用可以指代同一对象,无论这些使用是由不同的用户还是由同一用户在不同时间使用,名称对于沟通和组织事物(以便人们以后可以找到它们)都具有无价的价值。
Sharing illustrates one fundamental purpose for names: as a communication and an organizing tool. Because two uses of the same name can refer to the same object, whether those uses are by different users or by the same user at different times, names are invaluable both for communication and for organization of things so that one can find them later.
名称的第二个基本目的是允许系统设计人员将一个重要的决定推迟到以后:这个名称应该引用哪个对象?名称还使以后更改该决定变得容易。例如,应用程序可以通过名称引用数据表。该表可能有多个版本,而使用哪个版本的决定可以等到程序真正需要该表时再决定。
A second fundamental purpose for a name is to allow a system designer to defer to a later time an important decision: to which object should this name refer? A name also makes it easy to change that decision later. For example, an application program may refer to a table of data by name. There may be several versions of that table, and the decision about which version to use can wait until the program actually needs the table.
通过使用名称作为中介将一个对象与另一个对象解耦称为间接。确定名称与对象之间的对应关系是绑定的一个例子。更改绑定是一种简单的机械方法,可以将一个对象替换为另一个对象。模块是对象,因此命名是模块化的基石。
Decoupling one object from another by using a name as an intermediary is known as indirection. Deciding on the correspondence between a name and an object is an example of binding. Changing a binding is a mechanically easy way to replace one object with another. Modules are objects, so naming is a cornerstone of modularity.
本节介绍计算机系统中名称使用的通用模型。该模型的某些部分应该很熟悉;上一节中对三个基本抽象的讨论介绍了名称和一些命名术语。该模型只是故事的一部分。第 3 章更深入地讨论了命名方案设计中出现的许多决策。
This section introduces a general model for the use of names in computer systems. Some parts of this model should be familiar; the discussion of the three fundamental abstractions in the previous section introduced names and some naming terminology. The model is only one part of the story. Chapter 3 discusses in more depth the many decisions that arise in the design of naming schemes.
建立一个名称如何与特定对象关联的模型会很有帮助。系统设计人员创建一种命名方案,它由三个元素组成。第一个元素是名称空间,它由符号字母表和指定哪些名称可以接受的语法规则组成。第二个元素是名称映射算法,它将名称空间中的一些(不一定是全部)名称与值域中的一些(同样,不一定是全部)值相关联,这是命名方案的第三个也是最后一个元素。值可以是对象,也可以是来自原始名称空间或不同名称空间的另一个名称。名称到值的映射是绑定的一个示例,当存在这样的映射时,名称就被称作已绑定到值。图 2.10进行了说明。
It is helpful to have a model of how names are associated with specific objects. A system designer creates a naming scheme, which consists of three elements. The first element is a name space, which comprises an alphabet of symbols together with syntax rules that specify which names are acceptable. The second element is a name-mapping algorithm, which associates some (not necessarily all) names of the name space with some (again, not necessarily all) values in a universe of values, which is the third and final element of the naming scheme. A value may be an object, or it may be another name from either the original name space or from a different name space. A name-to-value mapping is an example of a binding, and when such a mapping exists, the name is said to be bound to the value. Figure 2.10 illustrates.
图 2.10命名方案操作的一般模型。名称映射算法接受一个名称和一个上下文,并从值域中返回一个元素。箭头表示,使用上下文“A”,算法将名称“N4”解析为值“13”。
Figure 2.10 General model of the operation of a naming scheme. The name-mapping algorithm takes in a name and a context, and it returns an element from the universe of values. The arrows indicate that, using context “A”, the algorithm resolves the name “N4” to the value “13”.
在大多数系统中,通常同时运行几种不同的命名方案。例如,系统可能使用一种命名方案来命名电子邮件邮箱名称,使用第二种命名方案来命名 Internet 主机,使用第三种命名方案来命名文件,使用第四种命名方案来命名虚拟内存地址。当程序解释器遇到名称时,它必须知道要调用哪种命名方案。名称的使用环境通常提供足够的信息来识别命名方案。例如,在应用程序中,该程序的作者知道该程序应该期望文件名仅由文件系统解释,而 Internet 主机名仅由某些网络服务解释。
In most systems, typically several distinct naming schemes are in operation simultaneously. For example, a system may be using one naming scheme for e-mail mailbox names, a second naming scheme for Internet hosts, a third for files, and a fourth for virtual memory addresses. When a program interpreter encounters a name, it must know which naming scheme to invoke. The environment surrounding use of the name usually provides enough information to identify the naming scheme. For example, in an application program, the author of that program knows that the program should expect file names to be interpreted only by the file system and Internet host names to be interpreted only by some network service.
遇到名称的解释器将运行相应命名方案的名称映射算法。名称映射算法会解析名称,这意味着它发现并返回关联值(因此,名称映射算法也称为解析器)。名称映射算法通常由一个附加参数控制,称为上下文。对于给定的命名方案,可能存在许多不同的上下文,当解析器使用不同的上下文时,名称空间的单个名称可能映射到不同的值。例如,在普通话语中,当某人提到“你”、“这里”或“爱丽丝”这些名字时,每个名字的含义取决于此人说出这些名字的上下文。另一方面,一些命名方案只有一个上下文。这样的命名方案提供所谓的通用名称空间,它们具有一个很好的特性,即无论谁使用名称,名称在该命名方案中始终具有相同的含义。例如,在美国,社会保障号码(用于识别政府养老金和税务账户)构成了一个通用名称空间。当存在多个上下文时,解释器可以告诉解析器应使用哪一个,或者解析器可以使用默认上下文。
The interpreter that encounters the name runs the name-mapping algorithm of the appropriate naming scheme. The name-mapping algorithm resolves the name, which means that it discovers and returns the associated value (for this reason, the name-mapping algorithm is also called a resolver). The name-mapping algorithm is usually controlled by an additional parameter, known as a context. For a given naming scheme, there can be many different contexts, and a single name of the name space may map to different values when the resolver uses different contexts. For example, in ordinary discourse when a person refers to the names “you”, “here”, or “Alice”, the meaning of each of those names depends on the context in which the person utters it. On the other hand, some naming schemes have only one context. Such naming schemes provide what are called universal name spaces, and they have the nice property that a name always has the same meaning within that naming scheme, no matter who uses it. For example, in the United States, social security numbers, which identify government pension and tax accounts, constitute a universal name space. When there is more than one context, the interpreter may tell the resolver which one it should use or the resolver may use a default context.
We can summarize the naming model by defining the following conceptual operation on names:
值← RESOLVE (名称,上下文)
value ← RESOLVE (name, context)
当解释器在对象中遇到名称时,它首先会确定涉及的命名方案,从而确定应调用哪个版本的RESOLVE。然后,它会确定适当的上下文,在该上下文中解析名称,并在继续解释时用解析的值替换名称。变量context告诉RESOLVE要使用哪个上下文。该变量包含一个称为上下文引用的名称。
When an interpreter encounters a name in an object, it first figures out what naming scheme is involved and thus which version of RESOLVE it should invoke. It then identifies an appropriate context, resolves the name in that context, and replaces the name with the resolved value as it continues interpretation. The variable context tells RESOLVE which context to use. That variable contains a name known as a context reference.
在处理器中,寄存器编号就是名称。在简单的处理器中,寄存器名称集以及这些名称绑定到的寄存器都是在设计时固定的。在大多数使用名称的其他系统中(包括某些高性能处理器的寄存器命名方案),可以创建新的绑定并删除旧的绑定,枚举名称空间以获取现有绑定的列表,并比较两个名称。为此,我们定义了另外四个概念操作:
In a processor, register numbers are names. In a simple processor, the set of register names, and the registers those names are bound to, are both fixed at design time. In most other systems that use names (including the register naming scheme of some high-performance processors), it is possible to create new bindings and delete old ones, enumerate the name space to obtain a list of existing bindings, and compare two names. For these purposes we define four more conceptual operations:
状态← BIND(名称,值,上下文)
status ← BIND (name, value, context)
状态← UNBIND(名称,上下文)
status ← UNBIND (name, context)
列表← ENUMERATE(上下文)
list ← ENUMERATE (context)
结果←比较(名称1,名称2)
result ← COMPARE (name1, name2)
第一个操作通过添加新绑定来更改上下文;状态结果报告更改是否成功(如果提议的名称违反了名称空间的语法规则,更改可能会失败)。成功调用BIND后,RESOLVE将返回name的新值。*第二个操作UNBIND从上下文中删除现有绑定,状态再次报告成功或失败(可能是因为没有这样的现有绑定)。成功调用UNBIND后,RESOLVE将不再返回name的该值。BIND和UNBIND操作允许使用名称在对象之间建立连接并在以后更改这些连接。对象的设计者可以使用名称来引用组件对象,通过调用 BIND 在当时或以后选择该名称绑定到的对象,并通过调用UNBIND消除不再合适的绑定,所有这些都无需修改使用该名称的对象。这种延迟和更改绑定的能力是几乎所有系统设计中使用的强大工具。一些命名实现提供了ENUMERATE操作,该操作返回context中可以解析的所有名称的列表。ENUMERATE的一些实现还可以返回context中当前绑定的所有值的列表。最后,COMPARE操作报告(TRUE或FALSE ) name1是否与name2相同。“相同”的含义是一个有趣的问题,将在2.2.5 节中讨论,它可能需要提供额外的上下文参数。
The first operation changes context by adding a new binding; the status result reports whether or not the change succeeded (it might fail if the proposed name violates the syntax rules of the name space). After a successful call to BIND, RESOLVE will return the new value for name.* The second operation, UNBIND, removes an existing binding from context, with status again reporting success or failure (perhaps because there was no such existing binding). After a successful call to UNBIND, RESOLVE will no longer return that value for name. The BIND and UNBIND operations allow the use of names to make connections between objects and change those connections later. A designer of an object can, by using a name to refer to a component object, choose the object to which that name is bound either then or at a later time by invoking BIND, and eliminate a binding that is no longer appropriate by invoking UNBIND, all without modifying the object that uses the name. This ability to delay and change bindings is a powerful tool used in the design of nearly all systems. Some naming implementations provide an ENUMERATE operation, which returns a list of all the names that can be resolved in context. Some implementations of ENUMERATE can also return a list of all values currently bound in context. Finally, the COMPARE operation reports (TRUE or FALSE) whether or not name1 is the same as name2. The meaning of “same” is an interesting question addressed in Section 2.2.5, and it may require supplying additional context arguments.
不同的命名方案对名称到值映射的唯一性有不同的规则。有些命名方案有这样的规则:在给定上下文中,名称必须映射到一个值,并且值必须只有一个名称;而在其他命名方案中,即使在同一个上下文中,一个名称也可以映射到多个值,或者一个值可以有多个名称。另一种唯一性规则是唯一标识符名称空间的规则,它提供了一组名称,这些名称在名称空间的生命周期内永远不会被重复使用,并且一旦绑定,将始终绑定到相同的值。这样的名称被称为具有稳定绑定。如果唯一标识符名称空间还具有一个值只能有一个名称的规则,则唯一名称对于长期跟踪对象、比较引用以查看它们是否指向同一个对象以及在为了提高性能或可靠性而复制对象的系统中协调多个副本非常有用。例如,大多数计费系统的客户帐号构成了一个唯一标识符名称空间。只要账户存在,帐号就始终指向同一个客户账户,无论客户的地址、电话号码甚至个人姓名如何变化。如果删除了客户的账户,该客户的帐号不会在某一天被重新用于其他客户的账户。账户内的命名字段(例如应付余额)可能会不时发生变化,但客户帐号与账户本身之间的绑定是稳定的。
Different naming schemes have different rules about the uniqueness of name-to-value mappings. Some naming schemes have a rule that a name must map to exactly one value in a given context and a value must have only one name, while in other naming schemes one name may map to several values, or one value may have several names, even in the same context. Another kind of uniqueness rule is that of a unique identifier name space, which provides a set of names that will never be reused for the lifetime of the name space and, once bound, will always remain bound to the same value. Such a name is said to have a stable binding. If a unique identifier name space also has the rule that a value can have only one name, the unique names become useful for keeping track of objects over a long period of time, for comparing references to see if they are to the same object, and for coordination of multiple copies in systems where objects are replicated for performance or reliability. For example, the customer account number of most billing systems constitutes a unique identifier name space. The account number will always refer to the same customer’s account as long as that account exists, despite changes in the customer’s address, telephone number, or even personal name. If a customer’s account is deleted, that customer’s account number will not someday be reused for a different customer’s account. Named fields within the account, such as the balance due, may change from time to time, but the binding between the customer account number and the account itself is stable.
名称映射算法加上单个上下文不一定将名称空间的所有名称映射到值。因此,执行RESOLVE的一个可能结果可能是未找到结果,RESOLVE可能会将其作为保留值或异常传达给调用者。另一方面,如果命名方案允许一个名称映射到多个值,则可能的结果可能是值的列表。在这种情况下,UNBIND操作可能需要一个附加参数来指定要解除绑定的值。最后,一些命名方案提供反向查找,这意味着调用者可以将值作为参数提供给名称映射算法,并找出与该值绑定的名称。
The name-mapping algorithm plus a single context do not necessarily map all names of the name space to values. Thus, a possible outcome of performing RESOLVE can be a not-found result, which RESOLVE may communicate to the caller either as a reserved value or as an exception. On the other hand, if the naming scheme allows one name to map to several values, a possible outcome can be a list of values. In that case, the UNBIND operation may require an additional argument that specifies which value to unbind. Finally, some naming schemes provide reverse lookup, which means that a caller can supply a value as an argument to the name-mapping algorithm, and find out what name or names are bound to that value.
图 2.10说明了命名模型,显示了名称空间、相应的值域、名称映射算法以及控制名称映射算法的上下文。
Figure 2.10 illustrates the naming model, showing a name space, the corresponding universe of values, a name-mapping algorithm, and a context that controls the name-mapping algorithm.
在实践中,人们会遇到三种常用的名称映射算法:
In practice, one encounters three frequently used name-mapping algorithms:
上下文的最常见实现是{名称,值}对的表。当上下文的实现是表时,名称映射算法只是在该表中查找名称。表本身可能很复杂,涉及哈希或 B 树,但基本思想仍然相同。将新名称绑定到值包括将{名称,值}对添加到表中。图 2.11说明了命名模型的这种常见实现。每个上下文都有一个这样的表,不同的上下文可能包含相同名称的不同绑定。
The most common implementation of a context is a table of {name, value} pairs. When the implementation of a context is a table, the name-mapping algorithm is just a lookup of the name in that table. The table itself may be complex, involving hashing or B-trees, but the basic idea is still the same. Binding a new name to a value consists of adding that {name, value} pair to the table. Figure 2.11 illustrates this common implementation of the naming model. There is one such table for each context, and different contexts may contain different bindings for the same name.
图 2.11使用表查找作为名称映射算法的系统。与图 2.10的示例一样,该系统也将名称“N4”解析为值“13”。
Figure 2.11 A system that uses table lookup as the name-mapping algorithm. As in the example of Figure 2.10, this system also resolves the name “N4” to the value “13”.
Real-world examples of both the general naming model and the table-lookup implementation abound:
1.电话簿是一种将人员和组织的名称与电话号码绑定在一起的表查找上下文。与数据通信网络示例一样,电话号码本身就是名称,电话公司使用涉及区号、交换机和物理开关设备的名称映射算法将其解析为物理线路外观。波士顿和旧金山的电话簿是同一命名方案的两个上下文;任何特定名称都可能出现在两本电话簿中,但如果是这样,它可能与不同的电话号码绑定。
1. A telephone book is a table-lookup context that binds names of people and organizations to telephone numbers. As in the data communication network example, telephone numbers are themselves names that the telephone company resolves into physical line appearances, using a name-mapping algorithm that involves area codes, exchanges, and physical switchgear. The telephone books for Boston and for San Francisco are two contexts of the same naming scheme; any particular name may appear in both telephone books, but if so, it is probably bound to different telephone numbers.
2.小整数为处理器的寄存器命名。值是寄存器本身,从名称到值的映射通过连线完成。
2. Small integers name the registers of a processor. The value is the register itself, and the mapping from name to value is accomplished by wiring.
3.内存单元同样以数字(称为地址)命名,名称到值的映射同样通过连接完成。第 5 章介绍了一种称为虚拟内存的地址重命名机制,它将虚拟地址块绑定到连续内存单元块。当系统实现多个虚拟内存时,每个虚拟内存都是一个不同的上下文;给定的地址可以引用每个虚拟内存中的不同内存单元。内存单元也可以在虚拟内存之间共享,在这种情况下,同一个内存单元可能在不同的虚拟内存中具有相同(或不同)的地址,由绑定决定。
3. Memory cells are similarly named with the numbers called addresses, and the name-to-value mapping is again accomplished by wiring. Chapter 5 describes an address-renaming mechanism known as virtual memory, which binds blocks of virtual addresses to blocks of contiguous memory cells. When a system implements multiple virtual memories, each virtual memory is a distinct context; a given address can refer to a different memory cell in each virtual memory. Memory cells can also be shared among virtual memories, in which case the same memory cell may have the same (or different) addresses in different virtual memories, as determined by the bindings.
4.典型的计算机文件系统使用多层名称和上下文:磁盘扇区、磁盘分区、文件和目录都是命名对象。目录是表查找上下文的示例。特定文件名可能出现在几个不同的目录中,绑定到相同或不同的文件。第 2.5 节介绍了UNIX文件系统中命名的案例研究。
4. A typical computer file system uses several layers of names and contexts: disk sectors, disk partitions, files, and directories are all named objects. Directories are examples of table-lookup contexts. A particular file name may appear in several different directories, bound to either the same or different files. Section 2.5 presents a case study of naming in the UNIX file system.
5.计算机在称为网络连接点的地方连接到数据通信网络。网络连接点通常使用两种不同的命名方案来命名。第一种方案用于网络内部,涉及由固定长度字段中的数字组成的名称空间。这些名称有时永久绑定,有时只是短暂绑定到网络的物理入口和出口点。第二种命名方案由网络客户端使用,将更用户友好的通用字符串名称空间映射到第一个名称空间的名称。第 4.4 节是域名系统的案例研究,它为 Internet 提供了用户友好的连接点命名。
5. Computers connect to data communication networks at places known as network attachment points. Network attachment points are usually named with two distinct naming schemes. The first one, used inside the network, involves a name space consisting of numbers in a fixed-length field. These names are bound, sometimes permanently and sometimes only briefly, to physical entrance and exit points of the network. A second naming scheme, used by clients of the network, maps a more user-friendly universal name space of character strings to names of the first name space. Section 4.4 is a case study of the Domain Name System, which provides user-friendly attachment point naming for the Internet.
6.程序员通过名称识别过程变量,并且过程的每次激活都会提供一个独特的上下文,大多数此类名称都会在该上下文中解析。一些名称被标识为“静态”或“全局名称”,它们可能会在激活之间或不同过程之间共享的上下文中解析。在编译过程时,一些原始的用户友好变量名称可能会被替换为更方便机器操作的整数标识符,但命名模型仍然有效。
6. A programmer identifies procedure variables by names, and each activation of the procedure provides a distinct context in which most such names are resolved. Some names, identified as “static” or “global names”, may instead be resolved in a context that is shared among activations or among different procedures. When a procedure is compiled, some of the original user-friendly names of variables may be replaced with integer identifiers that are more convenient for a machine to manipulate, but the naming model still holds.
7.万维网的统一资源定位符 (URL) 通过一个相对复杂的算法映射到特定的网页,该算法将 URL 分解为几个组成部分,并使用不同的命名方案解析这些部分;最终结果标识特定的网页。第 3.2 节是这种命名方案的案例研究。
7. A Uniform Resource Locator (URL) of the World Wide Web is mapped to a specific Web page by a relatively complicated algorithm that breaks the URL up into several constituent parts and resolves the parts using different naming schemes; the result eventually identifies a particular Web page. Section 3.2 is a case study of this naming scheme.
8.客户计费系统通常为每个客户帐户维护至少两种名称。帐号在唯一标识符名称空间中命名帐户,但还有一个独特的个人名称名称空间也可用于识别帐户。这两个名称通常由数据库系统映射到帐户记录,以便可以通过帐号或个人姓名检索帐户。
8. A customer billing system typically maintains at least two kinds of names for each customer account. The account number names the account in a unique identifier name space, but there is also a distinct name space of personal names that can also be used to identify the account. Both of these names are typically mapped to account records by a database system, so that accounts can be retrieved either by account number or by personal name.
这些例子也强调了“命名”和绑定之间的区别。有些(但不是全部)上下文会“命名”事物,即它们将名称映射到通常认为具有该名称的对象。因此,电话簿不会“命名”人或电话线路。在其他地方,有些上下文会将名称绑定到人,将电话号码绑定到特定的物理电话。电话簿将人的名字与电话名称绑定在一起。
These examples also highlight a distinction between “naming” and binding. Some, but not all, contexts “name” things, in the sense that they map a name to an object that is commonly thought of as having that name. Thus, the telephone directory does not “name” either people or telephone lines. Somewhere else there are contexts that bind names to people and that bind telephone numbers to particular physical phones. The telephone directory binds the names of people to the names of telephones.
对于每个示例,上下文引用必须标识名称映射算法应在其中解析名称的上下文。接下来,我们探索上下文引用来自何处。
For each of these examples a context reference must identify the context in which the name-mapping algorithm should resolve the name. Next, we explore where context references come from.
当程序解释器遇到对象中的名称时,必须有人提供上下文引用,以便名称映射算法知道应使用哪个上下文来解析名称。命名中许多看似令人费解的问题都可以简单地诊断出来:名称映射算法出于某种原因使用了错误的上下文引用。
When a program interpreter encounters a name in an object, someone must supply a context reference so that the name-mapping algorithm can know which context it should use to resolve the name. Many apparently puzzling problems in naming can be simply diagnosed: the name-mapping algorithm, for whatever reason, used the wrong context reference.
有两种方法可以提出一个上下文来解析对象中找到的名称:默认和显式。默认上下文引用是解析器提供的引用,而显式上下文引用是与使用名称的对象一起打包的引用。有时命名方案允许使用显式和默认方法:如果对象或名称提供了显式上下文引用,则它使用显式上下文引用;如果没有,则使用默认上下文。图 2.12概述了接下来两段中描述的上下文引用的分类。
There are two ways to come up with a context with which to resolve the names found in an object: default and explicit. A default context reference is one that the resolver supplies, whereas an explicit context reference is one that comes packaged with the name-using object. Sometimes a naming scheme allows for use of both explicit and default methods: it uses an explicit context reference if the object or name provides one; if not, it uses a default context. Figure 2.12 outlines the taxonomy of context references described in the next two paragraphs.
图 2.12上下文引用的分类。
Figure 2.12 Taxonomy of context references.
默认上下文引用可以是解析器在设计时内置的一个常量。由于一个常量只允许一个上下文,因此生成的命名空间是通用的。或者,默认上下文引用可以是解析器从其当前执行环境中获取的一个变量。该变量可以由某些上下文分配规则设置。例如,在大多数多用户系统中,每个用户的执行环境都包含一个称为工作目录的状态变量。工作目录充当解析文件名的默认上下文。类似地,系统可以为用户的每个不同活动分配一个默认上下文,甚至可以为系统的每个主要子系统分配默认上下文(如第 3 章(图 3.2和3.3 )中所示) 。
A default context reference can be a constant that is built in to the resolver as part of its design. Since a constant allows for just one context, the resulting name space is universal. Alternatively, a default context reference can be a variable that the resolver obtains from its current execution environment. That variable may be set by some context assignment rule. For example, in most multiple-user systems, each user’s execution environment contains a state variable called the working directory. The working directory acts as the default context for resolving file names. Similarly, the system may assign a default context for each distinct activity of a user or even, as will be seen in Chapter 3 (Figures 3.2 and 3.3), for each major subsystem of a system.
相比之下,显式上下文引用通常有两种形式:用于对象使用的所有名称的单上下文引用,或与对象中每个名称相关联的不同上下文引用。第二种形式,其中每个名称都与其自己的上下文引用一起打包,称为限定名称。
In contrast, an explicit context reference commonly comes in one of two forms: a single-context reference intended to be used for all the names that an object uses, or a distinct context reference associated with each name in the object. The second form, in which each name is packaged with its own context reference, is known as a qualified name.
上下文引用本身就是一个名称(它命名上下文),这导致一些作者将其描述为基本名称。因此,名称解析器必须先解析上下文引用所表示的名称,然后才能继续进行原始名称解析。此递归可以重复多次,但它必须在某处终止,即调用具有单个内置上下文的名称解析器。此内置上下文包含允许解开递归的绑定。
A context reference is itself a name (it names the context), which leads some writers to describe it as a base name. The name resolver must thus resolve the name represented by the context reference before it can proceed with the original name resolution. This recursion may be repeated several times, but it must terminate somewhere, with the invocation of a name resolver that has a single built-in context. This built-in context contains the bindings that permit the recursion to be unraveled.
这个描述相当抽象。为了使其具体化,让我们重新回顾之前现实世界中的名称示例,在每种情况下寻找解析器使用的上下文引用:
That description is quite abstract. To make it concrete, let’s revisit the previous real-world examples of names, in each case looking for the context reference the resolver uses:
1.在电话簿中查找号码时,您必须提供上下文参考:您需要知道是要查找波士顿电话簿还是旧金山电话簿。如果您致电查号服务询问号码,接线员会立即询问您上下文参考,说“请问是哪个城市?”如果您从私人信件中得知姓名,该信件可能会提到城市——这是明确上下文参考的一个例子。如果没有,您可能不得不猜测,或者搜索几个不同城市的目录。
1. When looking up a number in a telephone book, you must provide the context reference: you need to know whether to pick up the Boston or the San Francisco telephone book. If you call Directory Assistance to ask for a number, the operator will immediately ask you for the context reference by saying, “What city, please?” If you got the name from a personal letter, that letter may mention the city—an example of an explicit context reference. If not, you may have to guess, or undertake a search of the directories of several different cities.
2.在处理器中,通常只有一组编号寄存器;它们组成使用线路内置的默认上下文。有些处理器有多个寄存器组,在这种情况下,还有一个额外的寄存器(通常对应用程序员隐藏),用于确定当前正在使用哪个寄存器组。处理器使用该寄存器的内容(当前解释环境的一个组成部分)作为默认上下文引用。它使用内置上下文解析该编号,通过将寄存器组编号解释为定位正确寄存器组的地址,将寄存器组编号绑定到物理寄存器组。
2. In a processor, there is usually only one set of numbered registers; they comprise a default context that is built-in using wires. Some processors have multiple register sets, in which case there is an additional register, usually hidden from the application programmer, that determines which register set is currently in use. The processor uses the contents of that register, which is a component of the current interpretation environment, as a default context reference. It resolves that number with a built-in context that binds register set numbers to physical register sets by interpreting the register set number as an address that locates the correct bank of registers.
3.在实现多个虚拟内存的系统中,解释环境包括一个处理器寄存器(第 5 章的页面映射地址寄存器),该寄存器命名当前活动的页表;该寄存器包含对默认上下文的引用。一些虚拟内存系统提供了一种称为段的功能。在这些系统中,程序可能会发出包含显式上下文引用(称为段号)的地址。段将在第 5.4.5 节中讨论。
3. In a system that implements multiple virtual memories, the interpretation environment includes a processor register (the page-map address register of Chapter 5) that names the currently active page table; that register contains a reference to the default context. Some virtual memory systems provide a feature known as segments. In those systems, a program may issue addresses that contain an explicit context reference known as a segment number. Segments are discussed in Section 5.4.5.
4.在具有许多目录的文件系统中,当程序使用不合格或不完全合格的文件名引用文件时,文件系统将使用工作目录作为默认上下文引用。或者,程序可以使用绝对路径名,这是完全合格名称的一个示例,我们稍后将深入讨论。路径名包含其自己的显式上下文引用。在工作目录和绝对路径名中,上下文引用本身都是解析器必须解析的名称,然后才能继续进行原始名称解析。这种需要导致了递归名称解析,这将在第2.2.3 节中讨论。
4. In a file system with many directories, when a program refers to a file using an unqualified or incompletely qualified file name, the file system uses the working directory as a default context reference. Alternatively, a program may use an absolute path name, an example of a fully qualified name that we will discuss in depth in just a moment. The path name contains its own explicit context reference. In both the working directory and the absolute path name, the context reference is itself a name that the resolver must resolve before it can proceed with the original name resolution. This need leads to recursive name resolution, which is discussed in Section 2.2.3.
5.在 Internet 中,网络连接点的名称可能是限定的(例如ginger.pedantic.edu)或非限定的(例如ginger)。当网络名称解析器遇到非限定名称时,它会使用默认上下文引用(有时称为默认域)来限定该名称。无论如何实现,限定名称都是绝对路径名,仍然需要解析。另一个默认值(通常是名称解析器的配置参数)在 Internet 域名的通用名称空间中提供该绝对路径名的解析上下文。第 4.4 节详细描述了解析 Internet 域名的相当复杂的机制。
5. In the Internet, names of network attachment points may be qualified (e.g., ginger.pedantic.edu) or unqualified (e.g., ginger). When the network name resolver encounters an unqualified name, it qualifies that name with a default context reference, sometimes called the default domain. However it materializes, a qualified name is an absolute path name that still needs to be resolved. A different default—usually a configuration parameter of the name resolver—supplies the context for resolution of that absolute path name in the universal name space of Internet domain names. Section 4.4 describes in detail the rather elaborate mechanism that resolves Internet domain names.
6.编程语言社区使用自己的术语来描述默认和显式上下文引用。在实现动态作用域时,解析器使用当前命名环境作为解析名称的默认上下文。在实现静态(也称为词法)作用域时,对象(通常是过程对象)的创建者将对象与显式上下文引用(即当时的命名环境)关联。语言社区将对象及其上下文引用的这种组合称为闭包。
6. The programming language community uses its own terminology to describe default and explicit context references. When implementing dynamic scope, the resolver uses the current naming environment as a default context for resolving names. When implementing static (also called lexical) scope, the creator of an object (usually a procedure object) associates the object with an explicit context reference—the naming environment at that instant. The language community calls this combination of an object and its context reference a closure.
7.对于万维网的 URL 解析,名称解析器是分布式的,并且对 URL 的不同组件使用不同的上下文。第 3.2 节提供了详细信息。
7. For resolution of a URL for the World Wide Web, the name resolver is distributed, and different contexts are used for different components of the URL. Section 3.2 provides details.
8.数据库系统为结算系统中的帐号和个人姓名的解析提供上下文。如果结算系统具有图形用户界面,则它可能提供一个查找表单,其中帐号和个人姓名的字段均为空白。客户服务代表通过键入两个字段之一并点击“查找”按钮来选择上下文引用,这将调用解析器。每个字段都对应不同的上下文。
8. Database systems provide the contexts for resolution of both account numbers and personal names in a billing system. If the billing system has a graphical user interface, it may offer a lookup form with blank fields for both account number and personal name. A customer service representative chooses the context reference by typing in one of the two fields and hitting a “find” button, which invokes the resolver. Each of the fields corresponds to a different context.
上下文引用可以是动态的,这意味着它会随时间而变化。例如,当用户单击标有“帮助”的菜单按钮时。尽管按钮可能始终出现在屏幕上的同一位置,但解析名称“帮助”的上下文(以及响应中显示的特定帮助屏幕)可能取决于用户单击按钮时正在运行的应用程序,甚至该程序的哪个部分。
A context reference can be dynamic, meaning that it changes from time to time. An example is when the user clicks on a menu button labeled “Help”. Although the button may always appear in the same place on the screen, the context in which the name “Help” is resolved (and thus the particular help screen that appears in response) is likely to depend on which application program, or even which part of that program, is running at the instant that the user clicks on the button.
一个常见问题是使用名称的对象没有提供显式上下文,而名称解析器选择了错误的默认上下文。例如,文件系统通常根据当前工作目录解析文件名,即使此工作目录可能与进行引用的程序或数据对象的身份无关。与编程系统的名称解析环境相比,大多数文件系统提供的名称解析机制相当原始。
A common problem is that the object that uses a name does not provide an explicit context, and the name resolver chooses the wrong default context. For example, a file system typically resolves a file name relative to a current working directory, even though this working directory may be unrelated to the identity of the program or data object making the reference. Compared with the name resolution environment of a programming system, most file systems provide a rather primitive name resolution mechanism.
电子邮件系统提供了一个确保名称在预期上下文中得到解释的问题示例。考虑图 2.13中的电子邮件消息,该消息源自 Pedantic 大学。在此消息中,Alice、Bob和Dawn是来自 Pedantic 大学电子邮箱本地名称空间的名称,而Charles@cse.Scholarly.edu是学术研究所名为cse.Scholarly.edu的邮件服务管理的电子邮箱的限定名称。名称Charles是该邮件服务中的一个特定邮箱的名称,@ 符号通常用于将邮箱名称与邮件服务名称分开。
An electronic mail system provides an example of the problem of making sure that names are interpreted in the intended context. Consider the e-mail message of Figure 2.13, which originated at Pedantic University. In this message, Alice, Bob, and Dawn are names from the local name space of e-mailboxes at Pedantic University, and Charles@cse.Scholarly.edu is a qualified name of an e-mailbox managed by a mail service named cse.Scholarly.edu at the Institute of Scholarly Studies. The name Charles is of a particular mailbox at that mail service, and the @-sign is conventionally used to separate the name of the mailbox from the name of the mail service.
图 2.13使用默认上下文的电子邮件。
Figure 2.13 An e-mail message that uses default contexts.
目前,如果用户Charles尝试回复此消息的发件人,则响应将发送给Bob。由于遇到回复消息的第一个名称解析器可能位于名为cse.Scholarly.edu的系统内部,因此该解析器通常会使用本地邮件服务的名称作为默认上下文引用。也就是说,它会尝试将消息发送到Bob@cse.Scholarly.edu。这不是发送原始消息的用户的邮箱地址。更糟糕的是,它可能是其他人的邮箱地址。
As it stands, if user Charles tries to reply to the sender of this message, the response will be addressed to Bob. Since the first name resolver to encounter the reply message is probably inside the system named cse.Scholarly.edu, that resolver would in the normal course of events use as a default context reference the name of the local mail service. That is, it would try to send the message to Bob@cse.Scholarly.edu. That isn’t the mailbox address of the user who sent the original message. Worse, it might be someone else’s mailbox address.
在构造电子邮件消息时,Alice 希望在自己的上下文中解析诸如Bob之类的本地名称。大多数邮件发送系统都知道本地名称对本地上下文之外的任何人都没有用处,因此邮件系统通常会修改地址字段中发现的非限定名称,方法是自动将其重写为限定名称,从而为名称Bob和Alice添加显式上下文引用,如图2.14所示。
When constructing the e-mail message, Alice intended local names such as Bob to be resolved in her own context. Most mail sending systems know that a local name is not useful to anyone outside the local context, so it is conventional for the mail system to tinker with unqualified names found in the address fields by automatically rewriting them as qualified names, thus adding an explicit context reference to the names Bob and Alice, as shown in Figure 2.14.
图 2.14图 2.13中的电子邮件消息,在邮件系统扩展标题中的每个不合格地址以包含明确的上下文引用之后。
Figure 2.14 The e-mail message of Figure 2.13 after the mail system expands every unqualified address in the headers to include an explicit context reference.
不幸的是,邮件系统只能对标题执行此地址重写,因为这是邮件格式中它完全理解的唯一部分。如果电子邮件地址嵌入在邮件文本中(如示例中的邮箱名称Dawn),邮件系统就无法将其与其他文本区分开来。如果邮件收件人希望使用邮件文本中找到的电子邮件地址,则该收件人必须弄清楚需要什么上下文参考。有时很容易弄清楚该怎么做,但如果邮件已被转发几次,或者收件人没有意识到问题,则可能会出错。
Unfortunately, the mail system can perform this address rewriting only for the headers because that is the only part of the message format it fully understands. If an e-mail address is embedded in the text of the message (as in the example, the mailbox name Dawn), the mail system has no way to distinguish it from the other text. If the recipient of the message wishes to make use of an e-mail address found in the text of the message, that recipient is going to have to figure out what context reference is needed. Sometimes it is easy to figure out what to do, but if a message has been forwarded a few times, or the recipient is unaware of the problem, a mistake is likely.
部分解决方案可以是使用额外的标题为电子邮件消息添加显式上下文引用,如图2.15所示。通过此附加内容,此消息的收件人可以选择标题中的Alice或文本中的Dawn ,并要求邮件系统发送回复。邮件系统可以通过检查Context:标题来确定如何解析与此消息相关的任何不合格电子邮件地址,无论是在原始标题中找到还是从消息文本中提取。此方案非常临时;如果用户Bob将图 2.15中的消息连同附加注释一起转发给另一个命名上下文中的某人,则附加注释中的任何不合格地址都需要不同的显式上下文引用。虽然作者所知的任何电子邮件系统都没有实际使用此方案,但它已在其他命名系统中使用。一个例子是 HTML 的基本元素,万维网的显示语言,在第3.2.2 节中简要介绍。
A partial solution could be to tag the e-mail message with an explicit context reference, using an extra header, as in Figure 2.15. With this addition, a recipient of this message could select either Alice in the header or Dawn in the text and ask the mail system to send a reply. The mail system could, by examining the Context: header, determine how to resolve any unqualified e-mail address associated with this message, whether found in the original headers or extracted from the text of the message. This scheme is quite ad hoc; if user Bob forwards the message of Figure 2.15 with an added note to someone in yet another naming context, any unqualified addresses in the added note would need a different explicit context reference. Although this scheme is not actually used in any e-mail system that the authors are aware of, it has been used in other naming systems. An example is the base element of HTML, the display language of the World Wide Web, described briefly in Section 3.2.2.
图 2.15一封电子邮件,其中提供了明确的上下文引用作为其标题之一。
Figure 2.15 An e-mail message that provides an explicit context reference as one of its headers.
一个密切相关的问题是,不同的上下文可能为同一对象绑定不同的名称。例如,要拨打某个电话,同一组织中的一个人可能拨打 2-7104,城市另一端的第二个人拨打 312-7104,稍远的第三个人拨打 (517) 312-7104,而另一个国家的人可能必须拨打 001 (517) 312-7104。当同一对象在不同上下文中具有不同的名称时,将名称从一个用户传递给另一个用户会很尴尬,因为与电子邮件消息示例一样,必须有人翻译该名称,然后其他用户才能使用它。与电子邮件地址一样,如果有人递给你一张写有电话号码 312-7104 的纸片,那么简单地拨打该号码可能会或可能不会拨打目标电话。即使几个名称是相关的,也可能需要付出一些努力才能弄清楚到底需要进行什么翻译。
A closely related problem is that different contexts may bind different names for the same object. For example, to call a certain telephone, it may be that a person in the same organization dials 2-7104, a second person across the city dials 312-7104, a third who is a little farther away dials (517) 312-7104, and a person in another country may have to dial 001 (517) 312-7104. When the same object has different names in different contexts, passing a name from one user to another is awkward because, as with the e-mail message example, someone must translate the name before the other user can use it. As with the e-mail address, if someone hands you a scrap of paper on which is written the telephone number 312-7104, simply dialing that number may or may not ring the intended telephone. Even though the several names are related, some effort may be required to figure out just what translation is required.
第 64 页列出的三种常见名称映射算法中的第二种是递归名称解析。路径名可以被认为是明确包含对其应解析的上下文的引用的名称。在某些命名方案中,路径名以上下文引用在前的方式书写,在其他命名方案中,以上下文引用在后的方式书写。以下是一些路径名示例:
The second of the three common name-mapping algorithms listed on page 64 is recursive name resolution. A path name can be thought of as a name that explicitly includes a reference to the context in which it should be resolved. In some naming schemes, path names are written with the context reference first, in others with the context reference last. Some examples of path names are:
/usr/bin/emacs
/usr/bin/emacs
Macintosh HD:项目:CSE 496:问题集 1
Macintosh HD:projects:CSE 496:problem set 1
第 2 章第 2 节第 3 部分第一段
Chapter 2, section 2, part 3, first paragraph
第2章第2节第3部分第1款
Paragraph 1 of part 3 of section 2 of chapter 2
正如这些示例所示,路径名涉及多个组件和一些允许名称解析器解析组件的语法。最后两个示例说明,不同的命名方案将组件名称置于相反的顺序,而其他示例也确实展示了这两种顺序。名称的用户和名称解析器必须知道组件的顺序,但无论哪种方式,路径名的解释最容易通过借用数字表示中的术语来递归解释:路径名中除最不重要的组件之外的所有组件都是显式上下文引用,用于标识用于解析该最不重要的组件的上下文。在上述示例中,最不重要的组件及其显式上下文引用分别是
As these examples suggest, a path name involves multiple components and some syntax that permits a name resolver to parse the components. The last two examples illustrate that different naming schemes place the component names in opposite orders, and indeed the other examples also demonstrate both orders. The order of the components must be known to the user of the name and to the name resolver, but either way interpretation of the path name is most easily explained recursively by borrowing terminology from the representation of numbers: all but the least significant component of a path name is an explicit context reference that identifies the context to be used to resolve that least significant component. In the above examples, the least significant components and their explicit context references are, respectively,
最低有效成分显式上下文引用
Least significant component Explicit context reference
ginger pedantic.edu。
ginger pedantic.edu.
emacs /usr/bin
emacs /usr/bin
问题集 1 Macintosh hd:projects:CSE 491
problem set 1 Macintosh hd:projects:CSE 491
第一段 第 2 章第 2 节第 3 部分
first paragraph Chapter 2, section 2, part 3
第2章第2节第3部分第1款
Paragraph 1 part 3 of section 2 of chapter 2
此描述的递归方面是显式上下文引用本身就是必须解析的路径名。因此,我们根据需要重复分析多次,直到路径名中原本最重要的部分也成为最不重要的部分,此时解析器可以使用某个上下文进行普通的表查找。在选择此上下文时,先前关于默认和显式上下文引用的讨论再次适用。在典型设计中,解析器使用两个默认上下文引用之一:
The recursive aspect of this description is that the explicit context reference is itself a path name that must be resolved. So we repeat the analysis as many times as needed until what was originally the most significant component of the path name is also the least significant component, at which point the resolver can do an ordinary table lookup using some context. In the choice of this context, the previous discussion of default and explicit context references again applies. In a typical design, the resolver uses one of two default context references:
解析器内置了一个特殊的上下文引用,称为根。根是通用名称空间的一个示例。解析器可以通过递归解析以根上下文结尾的路径名称为绝对路径名。
A special context reference, known as the root, that is built in to the resolver. The root is an example of a universal name space. A path name that the resolver can resolve with recursion that ends at the root context is known as an absolute path name.
另一个默认上下文的路径名。为避免循环,此路径名必须是绝对路径名。通过在另一个上下文中查找其最重要组件来解析的路径名称为相对路径名。(在文件系统中,此默认上下文的路径名是第 68 页的示例 4 标识的工作目录。)因此,在UNIX文件系统中,例如,如果工作目录是/usr/Alice,则相对路径名plans/Monday将解析为与绝对路径名/usr/Alice/plans/Monday相同的文件。
The path name of yet another default context. To avoid circularity, this path name must be an absolute path name. A path name that is resolved by looking up its most significant component in yet another context is known as a relative path name. (In a file system, the path name of this default context is what example 4 on page 68 identified as the working directory.) Thus in the UNIX file system, for example, if the working directory is /usr/Alice, the relative path name plans/Monday would resolve to the same file as the absolute path name /usr/Alice/plans/Monday.
如果准备使用单个名称解析器来解析相对路径名和绝对路径名,则某些方案(例如,语法标志)(例如,/usr/bin/emacs中的初始“ / ”和ginger.pedantic.edu.中的终端“ . ” )可能会将两者区分开来,或者名称解析器可能会按某种顺序尝试两种方式,并使用似乎有效的第一种。按顺序尝试两种方案是多名称查找的一种简单形式,我们将在下一小节中详细介绍。
If a single name resolver is prepared to resolve both relative and absolute path names, some scheme such as a syntactic flag (e.g., the initial “/” in /usr/bin/emacs and the terminal “.” in ginger.pedantic.edu.) may distinguish one from the other, or perhaps the name resolver will try both ways in some order, using the first one that seems to work. Trying two schemes in order is a simple form of multiple name lookup, about which we will have more to say in the next subsection.
路径名也可以被认为是组织在所谓命名网络中的标识对象。在命名网络中,上下文被视为对象,任何上下文都可能包含任何其他对象(包括另一个上下文)的名称到对象绑定。名称解析器以某种方式选择一个上下文作为根(可能通过将该上下文的低级名称连接到解析器),然后通过跟踪从所选根到路径名中第一个命名上下文的路径来解析所有绝对路径名,然后再跟踪下一个上下文,继续下去,直到到达由原始路径名命名的对象。它类似地解析相对路径名,从其环境中的变量中找到的默认上下文开始。该变量包含默认上下文的绝对路径名。由于从一个地方到另一个地方可以有多条路径,因此同一对象或上下文可以有许多不同的路径名。同一对象的多个名称称为同义词或别名。 (本文避免使用“别名”一词,因为不同的系统使用该词的方式截然不同。)另一方面,由于根提供了一个通用名称空间,因此每个使用相同绝对路径名的对象都引用相同的导出对象。
Path names can also be thought of as identifying objects that are organized in what is called a naming network. In a naming network, contexts are treated as objects, and any context may contain a name-to-object binding for any other object, including another context. The name resolver somehow chooses one context to use as the root (perhaps by having a lower-level name for that context wired into the resolver), and it then resolves all absolute path names by tracing a path from the chosen root to the first named context in the path name, then the next, continuing until it reaches the object that was named by the original path name. It similarly resolves relative path names starting with a default context found in a variable in its environment. That variable contains the absolute path name of the default context. Since there can be many paths from one place to another, there can be many different path names for the same object or context. Multiple names for the same object are known as synonyms or aliases. (This text avoids the word “alias” because different systems use it in quite different ways.) On the other hand, since the root provides a universal name space, every object that uses the same absolute path name is referring to the same exporting object.
共享命名网络的名称可能会带来问题,因为每个用户可能表达相对于不同起点的路径名。因此,在将路径名从一个用户传递到另一个用户时,如何转换路径名可能并不明显。解决此问题的一个标准方法是要求用户仅共享绝对路径名,所有路径名均以根开头。
Sharing names of a naming network can be a problem because each user may express path names relative to a different starting point. As a result, it may not be obvious how to translate a path name when passing it from one user to another. One standard solution to this problem is to require that users share only absolute path names, all of which begin with the root.
计算机操作系统的文件系统通常组织为命名网络,其中目录充当上下文。在文件系统中,经常会遇到由实现驱动的对命名网络形状的限制,例如,要求上下文按命名层次结构组织,其中根充当树的底部。真正的命名层次结构非常严格,在实践中很少见到;实际系统即使表面上是层次化的,通常也会提供某种添加跨层次结构链接的方法。最简单的链接就是同义词:一个对象可能绑定在多个上下文中。某些系统允许一种更复杂的链接,称为间接名称。间接名称是上下文绑定到同一名称空间中的另一个名称而不是对象的名称。由于许多设计人员已经独立意识到间接名称很有用,因此它们有许多不同的标签,包括符号链接、软链接、别名和快捷方式。2.5 节中描述的UNIX文件系统包括命名层次结构、链接和称为软链接的间接名称。
The file system of a computer operating system is usually organized as a naming network, with directories acting as contexts. It is common in file systems to encounter implementation-driven restrictions on the shape of the naming network, for example, requiring that the contexts be organized in a naming hierarchy with the root acting as the base of the tree. A true naming hierarchy is so constraining that it is rarely found in practice; real systems, even if superficially hierarchical, usually provide some way of adding cross-hierarchy links. The simplest kind of link is just a synonym: a single object may be bound in more than one context. Some systems allow a more sophisticated kind of link, known as an indirect name. An indirect name is one that a context binds to another name in the same name space rather than to an object. Because many designers have independently realized that indirect names are useful, they have come to be called by many different labels, including symbolic link, soft link, alias, and shortcut. The UNIX file system described in Section 2.5 includes a naming hierarchy, links, and indirect names called soft links.
路径名具有内部结构,因此支持路径名的命名方案通常具有关于允许的路径名的构造的规则。路径名可能有最大长度,并且某些符号可能被限制仅用作结构分隔符。
A path name has internal structure, so a naming scheme that supports path names usually has rules regarding construction of allowable path names. Path names may have a maximum length, and certain symbols may be restricted for use only as structural separators.
回到默认上下文的主题(图 2.12的分类法中),上下文分配规则是一种钝化工具。例如,包含库程序的目录可能需要在不同用户之间共享;没有一条分配规则可以满足需要。这种不灵活性导致了第三种更复杂的名称解析方案,即多重查找。*多重查找的思想是放弃单一默认上下文的概念,而是通过系统地尝试几个不同的上下文来解析名称。由于名称可能绑定在多个上下文中,因此多重查找可以产生多个解析,因此需要某种方案来决定使用哪种解析。
Returning to the topic of default contexts (in the taxonomy of Figure 2.12), context assignment rules are a blunt tool. For example, a directory containing library programs may need to be shared among different users; no single assignment rule can suffice. This inflexibility leads to the third, more elaborate name resolution scheme, multiple lookup.* The idea of multiple lookup is to abandon the notion of a single, default context and instead resolve the name by systematically trying several different contexts. Since a name may be bound in more than one context, multiple lookup can produce multiple resolutions, so some scheme is needed to decide which resolution to use.
一种常见的此类方案称为搜索路径,它只不过是按顺序尝试的特定上下文列表。名称解析器尝试使用列表中的第一个上下文来解析名称。如果未找到结果,它会尝试下一个上下文,依此类推。如果名称绑定在列出的多个上下文中,则列表中最早的上下文获胜,解析器返回与该绑定关联的值。
A common such scheme is called the search path, which is nothing more than a specific list of contexts to be tried, in order. The name resolver tries to resolve the name using the first context in the list. If it gets a not-found result, it tries the next context, and so on. If the name is bound in more than one of the listed contexts, the one earliest in the list wins and the resolver returns the value associated with that binding.
搜索路径通常用于具有库的编程系统中。例如,假设计算平方根数学函数的库过程导出一个名为SQRT的过程接口。编译此函数后,编写者将二进制程序的副本放在数学库中。平方根函数的潜在用户会写下语句
A search path is often used in programming systems that have libraries. Suppose, for example, a library procedure that calculates the square root math function exports a procedure interface named SQRT. After compiling this function, the writer places a copy of the binary program in a math library. A prospective user of the square root function writes the statement
x ← SQRT ( y )
x ← SQRT(y)
在程序中,编译器生成使用名为SQRT的过程的代码。下一步是编译器(或在某些系统中为稍后的加载器)在它知道的各种公共和私有库中进行一系列查找。每个库都是一个上下文,搜索路径是库上下文的列表。多重查找的每一步都涉及调用更简单的单上下文名称解析器。其中一些尝试的解析可能会返回未找到的结果。第一次找到名为SQRT的程序的解析尝试将返回该程序作为查找的结果。
in a program, and the compiler generates code that uses the procedure named SQRT. The next step is that the compiler (or in some systems a later loader) undertakes a series of lookups in various public and private libraries that it knows about. Each library is a context, and the search path is a list of the library contexts. Each step of the multiple lookup involves an invocation of a simpler, single-context name resolver. Some of these attempted resolutions will probably return a not-found result. The first resolution attempt that finds a program named SQRT will return that program as the result of the lookup.
搜索路径通常以每个用户列表的形式实现,用户可以设置其中的部分或全部元素。通过将包含个人提供的程序的库放在搜索路径的早期,单个用户可以有效地将一个库程序替换为另一个同名的库程序,从而提供依赖于用户的绑定。此按名称替换功能很有用,但也可能很危险,因为用户可能会无意中选择某个程序的名称,而该程序也由某个完全不相关的库程序导出。当其他应用程序尝试调用该不相关的程序时,随后的多次查找可能会找到错误的程序。随着搜索路径中的库和名称数量的增加,两个库意外包含两个恰好导出相同名称的不相关程序的可能性也会增加。
A search path is usually implemented as a per-user list, some or all of whose elements the user can set. By placing a library that contains personally supplied programs early in the search path, an individual user can effectively replace a library program with another that has the same name, thereby providing a user-dependent binding. This replace-by-name feature can be useful, but it can also be hazardous because one may unintentionally choose a name for a program that is also exported by some completely unrelated library program. When some other application tries to call that unrelated program, the ensuing multiple lookup may find the wrong one. As the number of libraries and names in the search path increases, the chance increases that two libraries will accidentally contain two unrelated programs that happen to export the same name.
尽管存在危险,搜索路径仍是一种广泛使用的机制。除了加载器使用搜索路径来定位库过程之外,用户界面也使用搜索路径来定位用户输入名称的命令,编译器使用搜索路径来定位界面,文档系统使用搜索路径来查找引用的文档,文字处理系统使用搜索路径来定位要包含在当前文档中的文本片段。
Despite the hazards, search paths are a widely used mechanism. In addition to loaders using search paths to locate library procedures, user interfaces use search paths to locate commands whose names the user typed, compilers use search paths to locate interfaces, documentation systems use search paths to find cited documents, and word processing systems use search paths to locate text fragments to be included in the current document.
一些命名方案使用更受限制的多重查找方法。例如,命名方案可能要求将上下文排列在嵌套层中,而不是允许任意上下文列表。每当解析返回某个层中未找到时,解析器就会在封闭层中重试。分层上下文曾经在编程语言中很流行,程序在其中定义和调用子程序,因为允许子程序通过名称访问定义或调用程序的变量可能很方便(甚至不守规矩,这就是它不再那么流行的原因)。再举一个例子,用于对 Internet 网络连接点进行编号的方案具有外部公共层和内部私有层。某些 Internet 地址范围(例如,所有第一个字节为 10 的地址)保留用于私有网络;这些地址范围构成内部私有层。这些网络地址可以绑定到不同私有上下文中的不同网络连接点,而不会发生冲突。超出为私有上下文保留的范围的 Internet 地址不应绑定在任何私有上下文中;这些问题都是在公共背景下得到解决的。
Some naming schemes use a more restricted multiple lookup method. For example, rather than allowing an arbitrary list of contexts, a naming scheme may require that contexts be arranged in nested layers. Whenever a resolution returns not-found in some layer, the resolver retries in the enclosing layer. Layered contexts were at one time popular in programming languages, where programs define and call on subprograms, because it can be convenient (to the point of being undisciplined, which is why it is no longer so popular) to allow a subprogram access by name to the variables of the defining or calling program. For another example, the scheme for numbering Internet network attachment points has an outer public layer and an inner private layer. Certain Internet address ranges (e.g., all addresses with a first byte of 10) are reserved for use in private networks; those address ranges constitute an inner private layer. These network addresses may be bound to different network attachment points in different private contexts without risk of conflict. Internet addresses that are outside the ranges reserved for private contexts should not be bound in any private context; they are instead resolved in the public context.
在一组分层上下文中,名称的范围是名称绑定到同一对象的层的范围。 仅绑定在最外层且始终绑定到同一对象(与当前上下文层无关)的名称称为全局名称。 解析全局名称的最外层是通用名称空间的一个示例。
In a set of layered contexts, the scope of a name is the range of layers in which the name is bound to the same object. A name that is bound only in the outermost layer, and is always bound to the same object, independent of the current context layer, is known as a global name. The outermost layer that resolves global names is an example of a universal name space.
顺便说一句,我们现在将术语“路径”用作形容词限定词和名词,但含义截然不同。路径名是带有其自身明确上下文的名称,而搜索路径是由上下文列表组成的上下文。因此,搜索路径的每个元素都可以是路径名。
Incidentally, we have now used the term path as both an adjective qualifier and a noun, but with quite different meanings. A path name is a name that carries its own explicit context, while a search path is a context that consists of a list of contexts. Thus each element of a search path may be a path name.
“搜索”一词还有另一个相关但略有不同的含义。Google 和 AltaVista 等互联网搜索引擎将由一个或多个关键词组成的查询作为输入,并返回包含这些关键词的万维网页面列表。多个结果(称为“命中”)是常见情况,例如,Google 实施了一个复杂的系统来对命中进行排名。Google 还为用户提供了仅接收排名最高的命中(“我很幸运”)或接收按排名排序的命中列表的选择。大多数现代台式计算机系统还提供某种形式的本地文件关键词搜索。当遇到不合格的“搜索”一词时,最好停下来弄清楚它是指多重查找还是关键词查询。
The word “search” also has another, related but somewhat different, meaning. Internet search engines such as Google and AltaVista take as input a query consisting of one or more key words, and they return a list of World Wide Web pages that contain those key words. Multiple results (known as “hits”) are the common case, and Google, for example, implements a sophisticated system for ranking the hits. Google also offers the user the choice of receiving just the highest-ranked hit (“I’m feeling lucky”) or receiving a rank-ordered list of hits. Most modern desktop computer systems also provide some form of key word search for local files. When one encounters the unqualified word “search”, it is a good idea to pause and figure out whether it refers to multiple lookup or to key word query.
如前所述,有时还会对名称应用另一个操作:
As mentioned earlier, one more operation is sometimes applied to names:
结果←比较(名称1,名称2)
result ← COMPARE(name1, name2)
其中result是一个二进制值,TRUE或FALSE。名称比较的含义需要一些思考,因为调用者可能会想到三个不同的问题之一:
where result is a binary value, TRUE or FALSE. The meaning of name comparison requires some thought because the invoker might have one of three different questions in mind:
1. Are the two names the same?
2. Are the two names bound to the same value?
3.如果该值实际上是存储容器的标识符,例如内存单元或磁盘扇区,那么存储容器的内容是否相同?
3. If the value or values are actually the identifiers of storage containers, such as memory cells or disk sectors, are the contents of the storage containers the same?
从机制上讲,第一个问题最容易回答,因为它只涉及比较两个名称的表示(“Jim Smith 和 Jim Smith 一样吗?”),而这正是名称解析器在表查找上下文中查找内容时所做的:在上下文中查找与要解析的名称相同的名称。另一方面,在许多情况下,答案是无用的,因为同一个名称可能在不同上下文中绑定到不同的值,而两个不同的名称可能是绑定到相同值的同义词。从第一个问题中可以了解到的只是名称字符串是否具有相同的位模式。
The first question is mechanically easiest to answer because it simply involves comparing the representations of the two names (“Is Jim Smith the same as Jim Smith?”), and it is exactly what a name resolver does when it looks things up in a table-lookup context: look through the context for a name that is the same as the one being resolved. On the other hand, in many situations the answer is not useful, since the same name may be bound to different values in different contexts and two different names may be synonyms that are bound to the same value. All one learns from the first question is whether or not the name strings have the same bit pattern.
因此,第二个问题的答案通常更有趣。(“刚刚获得诺贝尔奖的 Jim Smith 和我高中时认识的 Jim Smith 是同一个人吗?”)要获得该答案,需要将两个名称的上下文作为附加参数提供给COMPARE,以便它可以解析名称并比较结果。因此,例如,解析变量名A和变量名B可能会发现它们都绑定到相同的存储单元地址。即使是这个答案可能仍然没有揭示出预期的那么多信息,因为这两个名称可能解析为不同的下层命名方案的两个名称,在这种情况下,需要递归地询问有关下层名称的相同问题。例如,变量名A和B可能绑定到不同的存储单元地址,但如果正在使用虚拟内存,那么不同的虚拟存储单元地址可能会映射到相同的物理单元地址。(当我们读到第 5 章时,这个例子会更有意义。)
For that reason, the answer to the second question is often more interesting. (“Is the Jim Smith who just received the Nobel prize the same Jim Smith I knew in high school?”) Getting that answer requires supplying the contexts for the two names as additional arguments to COMPARE, so that it can resolve the names and compare the results. Thus, for example, resolving the variable name A and the variable name B may reveal that they are both bound to the same storage cell address. Even this answer may still not reveal as much as expected because the two names may resolve to two names of a different, lower-layer naming scheme, in which case the same questions need to be asked recursively about the lower-layer names. For example, variable names A and B may be bound to different storage cell addresses, but if a virtual memory is in use those different virtual storage cell addresses might map to the same physical cell address. (This example will make more sense when we reach Chapter 5.)
即使在到达该递归的底部之后,结果也可能是两个包含相同数据副本的不同物理存储容器的名称,或者可能是同一存储容器的两个不同的底层名称(即同义词)。(“关于 Jim Smith 的这个传记文件与关于 Jim Smith 的那个传记文件相同。是一份还是两份传记文件?”“关于 Edwin Aldrin 的这个传记与关于 Buzz Aldrin 的那个传记相同。这两个名字是同一个人的吗?”)因此出现了第三个问题,同时需要理解“相同”的含义。除非对底层物理表示有特定的理解,否则区分这两种情况的唯一方法可能是更改其中一个命名存储容器的内容,看看这是否会导致另一个内容发生变化。(“踢这个,看看那个会不会发出吱吱声。”)
Even after reaching the bottom of that recursion, the result may be the names of two different physical storage containers that contain identical copies of data, or it may be two different lower-layer names (that is, synonyms) for the same storage container. (“This biography file on Jim Smith is identical to that biography file on Jim Smith. Are there one or two biography files?” “This biography about Edwin Aldrin is identical to that biography about Buzz Aldrin. Are those two names for the same person?”) Thus the third question arises, along with a need to understand what it means to be the “same”. Unless one has some specific understanding of the underlying physical representation, the only way to distinguish the two cases may be to change the contents of one of the named storage containers and see if that causes the contents of the other one to change. (“Kick this one and see if that one squeals.”)
实际上,系统(和一些编程语言)通常提供几个具有不同语义的COMPARE运算符,旨在帮助回答这些不同的问题,程序员或用户必须了解哪种COMPARE操作适合当前任务。例如,LISP 语言提供了三个比较运算符,名为EQ(比较其命名参数的绑定)、EQU(比较其命名参数的值)和EQUALS(递归比较整个数据结构)。
In practice, systems (and some programming languages) typically provide several COMPARE operators that have different semantics designed to help answer these different questions, and the programmer or user must understand which COMPARE operation is appropriate for the task at hand. For example, the LISP language provides three comparison operators, named EQ (which compares the bindings of its named arguments), EQU (which compares the values of its named arguments), and EQUALS (which recursively compares entire data structures.)
所有名称引用的底层都是一个递归协议,它回答了“您如何知道要使用这个名称?”这个问题。这个名称发现协议会将对象导出的名称告知对象的潜在用户。名称发现涉及两个基本要素:导出器宣传名称的存在,而潜在用户则搜索合适的广告。名称发现之所以具有递归性,是因为名称用户必须首先知道要搜索广告的地方的名称。这个递归必须在某个地方终止,也许是在某个名称用户和某个名称导出器之间的直接、计算机外的通信中。
Underlying all name reference is a recursive protocol that answers the question, “How did you know to use this name?” This name discovery protocol informs an object’s prospective user of the name that the object exports. Name discovery involves two basic elements: the exporter advertises the existence of the name, while the prospective user searches for an appropriate advertisement. The thing that makes name discovery recursive is that the name user must first know the name of a place to search for the advertisement. This recursion must terminate somewhere, perhaps in a direct, outside-the-computer communication between some name user and some name exporter.
最简单的情况是,一个程序员编写了一个由两个过程组成的程序,其中一个过程通过名称引用另一个过程。由于这两个过程是同一个程序员编写的,因此名称发现是明确的,不需要递归。接下来,假设这两个程序是由两个不同的程序员编写的。想要按名称使用过程的程序员必须以某种方式发现导出的名称。一种可能性是,第二个程序员通过在走廊里大声喊出过程的名称来执行广告。另一种可能性是,使用程序的程序员在每个人都同意放置共享过程的共享目录中查找。该程序员如何知道该共享目录的名称?也许有人在走廊里大声喊出了这个名字。或者,它可能是一个标准库目录,其名称列在程序员的参考手册中,在这种情况下,该手册终止递归协议。虽然程序库名称通常不会出现在杂志广告或广告牌上,但在这些地方发现万维网站点的名称已变得很常见。名称发现可以采取多种形式:
The simplest case is a programmer who writes a program consisting of two procedures, one of which refers to the other by name. Since the same programmer wrote both, name discovery is explicit and no recursion is necessary. Next, suppose the two programs are written by two different programmers. The programmer who wants to use a procedure by name must somehow discover the exported name. One possibility is that the second programmer performs the advertisement by shouting the procedure’s name down the hall. Another possibility is that the using programmer looks in a shared directory in which everyone agrees to place shared procedures. How does that programmer know the name of that shared directory? Perhaps someone shouted that name down the hall. Or perhaps it is a standard library directory whose name is listed in the programmers’ reference manual, in which case that manual terminates the recursive protocol. Although program library names don’t usually appear in magazine advertisements or on billboards, it has become commonplace to discover the name of a World Wide Web site in such places. Name discovery can take several forms:
知名名称:广为宣传的名称(例如“Google”或“Yahoo!”),人们可以相信它至少和它所命名的东西一样稳定。偶然发现一个知名名称是发现名称的一种方法。
Well-known name: A name (such as “Google” or “Yahoo!”) that has been advertised so widely that one can depend on it being stable for at least as long as the thing it names. Running across a well-known name is a method of name discovery.
广播:一种宣传名称的方式,例如佩戴写有“您好,我叫……”的徽章、在公告板上张贴名称或将其发送到邮件列表。自动配置协议(有时称为“即插即用”或“零配置”)使用广播。它甚至可以用于点对点通信链路,希望另一端有人在监听并回复。监听广播是一种名称发现方法。
Broadcast: A way of advertising a name, for example by wearing a badge that says “Hello, my name is … ”, posting the name on a bulletin board, or sending it to a mailing list. Broadcast is used by automatic configuration protocols sometimes called “plug-and-play” or “zero configuration”. It may even be used on a point-to-point communication link in the hope that there is someone listening at the other end who will reply. Listening for broadcasts is a method of name discovery.
查询(也称搜索):向搜索引擎(例如 Google)提供一个或多个关键词。查询是一种广泛使用的名称发现方法。
Query (also called search): Present one or more key words to, for example, a search engine such as Google. Query is a widely used method of name discovery.
广播查询:关键字查询的通用形式。询问听力范围内的每个人“有人知道……的名字吗?”(有时容易混淆地称为“反向广播”)。
Broadcast query: A generalized form of key word query. Ask everyone within hearing distance “does anyone know a name for… ?” (sometimes confusingly called “reverse broadcast”).
将一个名称空间的名称解析为另一个名称空间的名称:在电话簿中查找姓名可以找到电话号码。第 4.4 节中描述的 Internet 域名系统执行类似的服务,查找域名并返回网络连接点地址。
Resolving a name of one name space to a name of a different name space: Looking up a name in the telephone book leads to discovery of a telephone number. The Internet Domain Name System, described in Section 4.4, performs a similar service, looking up a domain name and returning a network attachment point address.
简介:在聚会和在线约会服务中发生的事情。您已经认识的某个实体知道一个名字并将该名字告诉您。在计算机系统中,朋友可能会向您发送一封电子邮件,其中提到一个有趣的网站的名称或另一个朋友的电子邮件地址。再举一个例子,每个万维网页面通常都包含指向其他网页的简介(技术上称为超文本链接)。
Introduction: What happens at parties and in on-line dating services. Some entity that you already know knows a name and gives that name to you. In a computer system, a friend may send you an e-mail message that mentions the name of an interesting Web site or the e-mail address of another friend. For another example, each World Wide Web page typically contains introductions (technically known as hypertext links) to other Web pages.
物理会合:在计算机之外举行的会议。它需要以某种方式事先安排时间和地点,这意味着事先沟通,这意味着事先了解一些名称。一旦建立,物理会合可用于发现其他名称以及验证真实性。许多组织要求在公司计算机系统上设置新帐户必须与系统管理员进行物理会合以交换名称并选择密码。
Physical rendezvous: A meeting held outside the computer. It requires somehow making prior arrangements concerning time and place, which implies prior communication, which implies prior knowledge of some names. Once set up, physical rendezvous can be used for discovering other names as well as for verifying authenticity. Many organizations require that setting up a new account on a company computer system must involve a physical rendezvous with the system administrator to exchange names and choose a password.
上述任何一种名称发现方法都可能需要首先发现其他名称,例如知名名称的参考源的名称、发布广播的公告板的名称、名称解析器的名称、聚会主持人的名称等等。发现这个其他名称的方法可能与第一次调用的方法相同,也可能不同。设计人员必须牢记的重要一点是,递归必须在某处终止 — 它不能是循环的。
Any of the above methods of name discovery may require first discovering some other name, such as the name of the reference source for well-known names, the name of the bulletin board on which broadcasts are placed, the name of the name resolver, the name of the party host, and so on. The method of discovering this other name may be the same as the method first invoked, or it may be different. The important thing the designer must keep in mind is that the recursion must terminate somewhere—it can’t be circular.
需要名称的地方都需要某种名称发现方法。一个有趣的练习是分析本章前面部分提到的一些名称示例,跟踪名称发现递归以查看其如何终止,因为在许多情况下,终止与名称使用和解析事件相距甚远,以至于早已被遗忘。后面的章节将介绍许多其他名称发现示例:用于客户端和服务的名称,其中客户端需要发现适当服务的名称;数据通信网络,其中路由提供了一个特别明确的名称发现示例;以及安全性,其中确定终止步骤的完整性至关重要。
Some method of name discovery is required wherever a name is needed. An interesting exercise is to analyze some of the examples of names mentioned in earlier parts of this chapter, tracing the name discovery recursion to see how it terminates, because in many cases that termination is so distant from the event of name usage and resolution that it has long since been forgotten. Many additional examples of name discovery will show up in later chapters: names used for clients and services, where a client needs to discover the name of an appropriate service; data communication networks, where routing provides a particularly explicit example of name discovery; and security, where it is critical to establish the integrity of the terminating step.
第 2.1 节演示了计算机系统设计人员如何使用层来实现三个基本抽象的更复杂版本,第 2.2 节解释了如何使用名称连接系统组件。设计人员还在计算机系统中以许多其他方式使用层和名称。图 2.16显示了计算机系统的典型组织结构,分为三个不同的层。底层由硬件组件组成,例如处理器、内存和通信链路。中间层由一组软件模块组成,称为操作系统(参见边栏 2.4),它将这些硬件资源抽象为方便的应用程序编程接口 (API)。顶层由实现特定于应用程序的功能的软件组成,例如文字处理器、工资单程序、计算机游戏或 Web 浏览器。如果我们详细检查每一层,我们很可能会发现它本身也是分层组织的。例如,硬件层可能由门、触发器和电线的下层以及寄存器、存储单元和有限状态机的上层组成。
Section 2.1 demonstrated how computer system designers use layers to implement more elaborate versions of the three fundamental abstractions, and Section 2.2 explained how names are used to connect system components. Designers also use layers and names in many other ways in computer systems. Figure 2.16 shows the typical organization of a computer system as three distinct layers. The bottom layer consists of hardware components, such as processors, memories, and communication links. The middle layer consists of a collection of software modules, called the operating system (see Sidebar 2.4), that abstract these hardware resources into a convenient application programming interface (API). The top layer consists of software that implements application-specific functions, such as a word processor, payroll program, computer game, or Web browser. If we examine each layer in detail, we are likely to find that it is itself organized in layers. For example, the hardware layer may comprise a lower layer of gates, flip-flops, and wires, and an upper layer of registers, memory cells, and finite-state machines.
图 2.16典型的计算机系统分为三层。操作系统层允许旁路,因此应用层可以直接调用硬件层的许多功能。但是,操作系统层隐藏了硬件层的某些危险功能。
Figure 2.16 A typical computer system organized in three layers. The operating system layer allows bypass, so the application layer can directly invoke many features of the hardware layer. However, the operating system layer hides certain dangerous features of the hardware layer.
边栏 2.4 什么是操作系统?
Sidebar 2.4 What is an Operating System?
操作系统是一组程序和库,可帮助计算机用户和程序员轻松完成工作。在计算机发展的早期,操作系统只是一些简单的程序,用于协助计算机操作员(当时操作员是唯一直接与计算机交互的用户),因此被称为操作系统。
An operating system is a set of programs and libraries that make it easy for computer users and programmers to do their job. In the early days of computers, operating systems were simple programs that assisted operators of computers (at that time the only users who interacted with a computer directly), which is why they are called operating systems.
如今,操作系统种类繁多,功能各异。最简单的计算机(如微波炉)的操作系统可能只包含一个隐藏硬件细节的库,以便应用程序员更轻松地开发应用程序。另一方面,个人计算机附带的操作系统包含数千万行代码。这些操作系统允许多人使用同一台计算机;允许用户控制共享哪些信息以及与谁共享;可以同时运行多个程序,同时防止它们相互干扰;提供复杂的用户界面、互联网访问、文件系统、备份和存档应用程序、个人计算机上许多可能的硬件设备的设备驱动程序,以及各种抽象以简化应用程序员的工作,等等。
Today operating systems come in many flavors and differ in the functions they provide. The operating system for the simplest computers, such as that for a microwave oven, may comprise just a library that hides hardware details in order to make it easier for application programmers to develop applications. Personal computers, on the other hand, ship with operating systems that contain tens of millions of lines of code. These operating systems allow several people to use the same computer; permit users to control which information is shared and with whom; can run many programs at the same time while keeping them from interfering with one another; provide sophisticated user interfaces, Internet access, file systems, backup and archive applications, device drivers for the many possible hardware gadgets on a personal computer, and a wide range of abstractions to simplify the job of application programmers, and so on.
操作系统也为系统设计提供了一个有趣的案例研究。由于新的需求,操作系统正在迅速发展。它们的设计者面临着控制复杂性的持续斗争。一些现代操作系统的界面由数千个程序组成,它们的实现非常复杂,因此很难让它们可靠地工作。
Operating systems also offer an interesting case study of system design. They are evolving rapidly because of new requirements. Their designers face a continuous struggle to control their complexity. Some modern operating systems have interfaces consisting of thousands of procedures, and their implementations are so complex that it is a challenge to make them work reliably.
本书对操作系统进行了更多的讨论,从第 5.1.1 节开始,它开始开发一个最小模型操作系统。
This book has much more to say about operating systems, starting in Section 5.1.1, where it begins development of a minimal model operating system.
硬件层和软件层之间的确切分工是一种工程权衡,也是硬件和软件设计师之间争论不休的话题。原则上,每个软件模块都可以用硬件实现。同样,除了晶体管和电线等少数基础组件外,大多数硬件模块也可以用软件实现。对于如何在硬件或软件之间做出选择,很难说出一个通用原则。成本、性能、灵活性、便利性和使用模式是权衡的因素之一,但对于每个单独的功能,它们的权重可能不同。我们不是试图发明一种原则,而是在具体功能出现时讨论硬件和软件之间的权衡。
The exact division of labor between the hardware layer and the software layers is an engineering trade-off and a topic of considerable debate between hardware and software designers. In principle, every software module can be implemented in hardware. Similarly, most hardware modules can also be implemented in software, except for a few foundational components such as transistors and wires. It is surprisingly difficult to state a generic principle for how to decide between an implementation in hardware or software. Cost, performance, flexibility, convenience, and usage patterns are among the factors that are part of the trade-off, but for each individual function they may be weighted differently. Rather than trying to invent a principle, we discuss the trade-off between hardware and software in the context of specific functions as they come up.
操作系统层通常会表现出一种有趣的现象,我们称之为“层旁路”。操作系统通常不会完全隐藏较低的硬件层,而是只隐藏硬件层的少数功能,例如特别危险的指令。硬件层的其余功能(特别是底层处理器的大部分指令库)会通过操作系统层直接供应用层使用,如图2.16所示。因此,危险指令只能由操作系统层使用,而所有剩余指令都可以由操作系统层和应用层使用。从概念上讲,设计人员可以进行设置,使操作系统层拦截应用层对硬件层的每次调用,然后显式调用硬件层。这种设计会使频繁使用的接口速度慢到令人无法接受的程度,因此在通常的实现中,应用层直接调用硬件层,完全绕过操作系统层。操作系统出于性能原因提供旁路,但旁路并非操作系统独有,也不只是为了提高性能而使用。例如,互联网是一个分层的通信系统,允许绕过大多数层的大部分功能,以实现灵活性。
The operating system layer usually exhibits an interesting phenomenon that we might call layer bypass. Rather than completely hiding the lower, hardware layer, an operating system usually hides only a few features of the hardware layer, such as particularly dangerous instructions. The remaining features of the hardware layer (in particular, most of the instruction repertoire of the underlying processor) pass through the operating system layer for use directly by the application layer, as in Figure 2.16. Thus, the dangerous instructions can be used only by the operating system layer, while all of the remaining instructions can be used by both the operating system and application layers. Conceptually, a designer could set things up so that the operating system layer intercepts every invocation of the hardware layer by the application layer and then explicitly invokes the hardware layer. That design would slow a heavily used interface down unacceptably, so in the usual implementation the application layer directly invokes the hardware layer, completely bypassing the operating system layer. Operating systems provide bypass for performance reasons, but bypass is not unique to operating systems, nor is it used only to gain performance. For example, the Internet is a layered communication system that permits bypass of most features of most of its layers, to achieve flexibility.
在本节中,我们将研究分层计算机系统组织的两个示例:典型计算机系统底部的硬件层和创建典型应用程序编程接口(称为文件系统)的操作系统层的一部分。
In this section we examine two examples of layered computer system organization: the hardware layer at the bottom of a typical computer system and one part of the operating system layer that creates the typical application programming interface known as the file system.
典型计算机的硬件层由直接实现三个基本抽象的低级版本的模块构成。在图 2.17的示例中,处理器模块解释程序,随机存取存储器模块存储程序和数据,输入/输出 (I/O) 模块实现与计算机外部世界的通信链接。
The hardware layer of a typical computer is constructed of modules that directly implement low-level versions of the three fundamental abstractions. In the example of Figure 2.17, the processor modules interpret programs, the random access memory modules store both programs and data, and the input/output (I/O) modules implement communication links to the world outside the computer.
图 2.17一台计算机,其中有多个模块通过共享总线连接。数字是所连接模块响应的总线地址。
Figure 2.17 A computer with several modules connected by a shared bus. The numbers are the bus addresses to which the attached module responds.
每种硬件模块可能都有若干个示例 — — 多个处理器(可能一个芯片上有若干个处理器,这个组织的名称叫作多核)、多个内存以及若干种 I/O 模块。仔细检查后会发现,I/O 模块实际上是实现 I/O 程序的专用解释器。因此,磁盘控制器就是磁盘 I/O 程序的解释器。它的职责包括将磁盘地址映射到磁道和扇区号,以及将数据从磁盘移动到内存。网络控制器是一种解释器,它在另一端与一个或多个实际通信链路进行通信。显示控制器解释它在内存中找到的显示列表,并在显示过程中点亮显示屏上的像素。键盘控制器解释键击并将结果放入内存。时钟可能只是一个微型解释器,它不断用一天中的时间更新一个寄存器。
There may be several examples of each kind of hardware module—multiple processors (perhaps several on one chip, an organization that goes by the buzzword name multicore), multiple memories, and several kinds of I/O modules. On closer inspection the I/O modules turn out to be specialized interpreters that implement I/O programs. Thus, the disk controller is an interpreter of disk I/O programs. Among its duties are mapping disk addresses to track and sector numbers and moving data from the disk to the memory. The network controller is an interpreter that talks on its other side to one or more real communication links. The display controller interprets display lists that it finds in memory, lighting pixels on the display as it goes. The keyboard controller interprets keystrokes and places the result in memory. The clock may be nothing but a minuscule interpreter that continually updates a single register with the time of day.
各种模块插入共享总线,这是一种高度专业化的通信链路,用于向其他模块发送消息。总线设计多种多样,但它们有一些共同的特征。其中一个共同特征是一组由地址线、数据线和控制线组成的线路,它们连接到每个模块上的总线接口。由于总线是共享的,因此第二个共同特征是一组规则,称为总线仲裁协议,用于决定哪个模块可以在任何特定时间发送或接收消息。一些总线有一个附加模块,即总线仲裁器,它是一个电路或一个微型解释器,用于选择几个竞争模块中的哪个可以使用总线。在其他设计中,总线仲裁是分布在总线接口之间的一项功能。正如总线设计多种多样一样,总线仲裁协议也多种多样。一个特别有影响力的总线示例是 UNIBUS® ,由 Digital Equipment Corporation 在 20 世纪 70 年代推出。共享总线和标准仲裁协议提供的模块化有助于重塑计算机行业,如边栏 1.5中所述。
The various modules plug into the shared bus, which is a highly specialized communication link used to SEND messages to other modules. There are numerous bus designs, but they have some common features. One such common feature is a set of wires* comprising address, data, and control lines that connect to a bus interface on each module. Because the bus is shared, a second common feature is a set of rules, called the bus arbitration protocol, for deciding which module may send or receive a message at any particular time. Some buses have an additional module, the bus arbiter, a circuit or a tiny interpreter that chooses which of several competing modules can use the bus. In other designs, bus arbitration is a function distributed among the bus interfaces. Just as there are many bus designs, there are also many bus arbitration protocols. A particularly influential example of a bus is the UNIBUS®, introduced in the 1970s by Digital Equipment Corporation. The modularity provided by a shared bus with a standard arbitration protocol helped to reshape the computer industry, as was described in Sidebar 1.5.
总线设计的第三个共同特征是总线是一种广播链路,这意味着连接到总线的每个模块都会听到每条消息。由于大多数消息实际上只发送给一个模块,因此消息中称为总线地址的字段标识了预期的接收者。每个模块的总线接口配置为响应特定的一组总线地址。每个模块检查每条消息的总线地址字段(在并行总线中,通常由一组与消息其余部分分开的线路传输)并忽略任何不是发送给它的消息。因此,总线地址定义了一个地址空间。图 2.17显示两个处理器可能分别在总线地址 101 和 102 接受消息;显示控制器在总线地址 103 接受消息;磁盘控制器在总线地址 104 和 105 接受消息(使用两个地址可以方便区分对其两个磁盘的请求);网络在总线地址 106 接受消息;键盘在总线地址 107 接受消息;时钟位于总线地址 109。为了提高速度,内存模块通常配置一系列总线地址,每个内存地址一个总线地址。因此,如果图 2.17中的两个内存模块各自实现 1,024 个内存地址的地址空间,则它们可能分别配置总线地址 1024–2047 和 3072–4095。*
A third common feature of bus designs is that a bus is a broadcast link, which means that every module attached to the bus hears every message. Since most messages are actually intended for just one module, a field of the message called the bus address identifies the intended recipient. The bus interface of each module is configured to respond to a particular set of bus addresses. Each module examines the bus address field (which in a parallel bus is usually carried on a set of wires separate from the rest of the message) of every message and ignores any message not intended for it. The bus addresses thus define an address space. Figure 2.17 shows that the two processors might accept messages at bus addresses 101 and 102, respectively; the display controller at bus address 103; the disk controller at bus addresses 104 and 105 (using two addresses makes it convenient to distinguish requests for its two disks); the network at bus address 106; the keyboard at bus address 107; and the clock at bus address 109. For speed, memory modules typically are configured with a range of bus addresses, one bus address per memory address. Thus, if in Figure 2.17 the two memory modules each implement an address space of 1,024 memory addresses, they might be configured with bus addresses 1024–2047 and 3072–4095, respectively.*
任何希望通过总线发送消息的总线模块都必须知道预期接收者配置为接受的总线地址。某些总线中的名称发现非常简单:设置系统的人会将总线地址的知识明确配置到处理器软件中,然后该软件会通过总线发送的消息将此知识传递给其他模块。其他总线设计会在模块插入总线并宣布其存在时动态分配总线地址。
Any bus module that wishes to send a message over the bus must know a bus address that the intended recipient is configured to accept. Name discovery in some buses is quite simple: whoever sets up the system explicitly configures the knowledge of bus addresses into the processor software, and that software passes this knowledge along to other modules in messages it sends over the bus. Other bus designs dynamically assign bus addresses to modules as they are plugged in to the bus and announce their presence.
一种常见的总线设计称为拆分事务。在这种设计中,当一个模块想要与另一个模块通信时,第一个模块使用控制线上的总线仲裁协议来请求独占使用总线来发送消息。获得独占使用权后,模块会将目标模块的总线地址放在地址线上,将消息的剩余部分放在数据线上。假设在设计中,总线和连接到总线的模块以不协调的时钟运行(即,它们是异步的),然后它会在其中一条控制线上发出信号(称为READY),提醒其他模块总线上有消息。当接收模块注意到它的一个地址在总线的地址线上时,它会将该地址和数据线上的其余消息复制到其本地寄存器中,并在另一条控制线上发出信号(称为ACKNOWLEDGE)告诉发送者可以安全地释放总线,以便其他模块可以使用它。 (如果总线和模块都使用公共时钟运行,则不需要READY和ACKNOWLEDGE线;相反,每个模块在每个时钟周期检查地址线。)然后,接收器检查地址和消息并执行请求的操作,这可能涉及将一个或多个消息发送回原始请求模块,或者在某些情况下,甚至发送回其他模块。
A common bus design is known as split-transaction. In this design, when one module wants to communicate with another, the first module uses the bus arbitration protocol on the control wires to request exclusive use of the bus for a message. Once it has that exclusive use, the module places a bus address of the destination module on the address wires and the remainder of the message on the data wires. Assuming a design in which the bus and the modules attached to it run on uncoordinated clocks (that is, they are asynchronous), it then signals on one of the control wires (called READY) to alert the other modules that there is a message on the bus. When the receiving module notices that one of its addresses is on the address lines of the bus, it copies that address and the rest of the message on the data wires into its local registers and signals on another control line (called ACKNOWLEDGE) to tell the sender that it is safe to release the bus so that other modules can use it. (If the bus and the modules are all running with a common clock, the READY and ACKNOWLEDGE lines are not needed; instead, each module checks the address lines on each clock cycle.) Then, the receiver inspects the address and message and performs the requested operation, which may involve sending one or more messages back to the original requesting module or, in some cases, even to other modules.
例如,假设处理器 #2 在解释正在运行的应用程序时遇到指令
For example, suppose that processor #2, while interpreting a running application program, encounters the instruction
加载1742,R1
LOAD 1742, R1
意思是“将内存地址 1742 的内容加载到处理器寄存器 R1 中”。在最简单的方案中,处理器只是将其在指令中找到的地址直接转换为总线地址,而不做任何更改。因此,它通过总线发送此消息:
which means “load the contents of memory address 1742 into processor register R1”. In the simplest scheme, the processor just translates addresses it finds in instructions directly to bus addresses without change. It thus sends this message across the bus:
处理器#2 ⇒所有总线模块: {1742, READ , 102}
processor #2 ⇒ all bus modules: {1742, READ, 102}
该消息包含三个字段。第一个消息字段(1742)是内存 #1 响应的总线地址之一;第二个消息字段请求接收方执行读取操作;第三个消息字段指示接收方应使用总线地址102通过总线将结果值发回。每个内存模块识别的内存地址基于 2 的幂,因此内存模块只需检查几个高位地址位即可识别其自身范围内的所有地址。在这种情况下,总线地址在内存模块 1 识别的范围内,因此该模块通过将消息复制到其自己的寄存器中进行响应。它确认请求,处理器释放总线,然后内存模块执行内部操作
The message contains three fields. The first message field (1742) is one of the bus addresses to which memory #1 responds; the second message field requests the recipient to perform a READ operation; and the third indicates that the recipient should send the resulting value back across the bus, using the bus address 102. The memory addresses recognized by each memory module are based on powers of two, so the memory modules can recognize all of the addresses in their own range by examining just a few high-order address bits. In this case, the bus address is within the range recognized by memory module 1, so that module responds by copying the message into its own registers. It acknowledges the request, the processor releases the bus, and the memory module then performs the internal operation
值←读( 1742 )
value ← READ (1742)
有了值之后,内存模块现在自己获取总线,并通过执行总线操作将结果发送回处理器 #2
With value in hand, the memory module now itself acquires the bus and sends the result back to processor #2 by performing the bus operation
内存#1 ⇒所有总线模块: {102,值}
memory #1 ⇒ all bus modules: {102, value}
其中102是原始READ请求消息中提供的处理器总线地址。处理器可能正在等待此结果,它注意到总线地址线现在包含其自己的总线地址 102。因此,它将数据线中的值复制到其寄存器 R1 中,作为原始程序指令的请求。它确认收到消息,内存模块释放总线以供其他模块使用。
where 102 is the bus address of the processor as supplied in the original READ request message. The processor, which is probably waiting for this result, notices that the bus address lines now contain its own bus address 102. It therefore copies the value from the data lines into its register R1, as the original program instruction requested. It acknowledges receipt of the message, and the memory module releases the bus for use by other modules.
简单的 I/O 设备(如键盘)以类似的方式运行。在系统初始化时,其中一个处理器向键盘控制器发送一条消息,告诉它将所有击键发送到该处理器。每次用户按下一个键时,键盘控制器都会向处理器发送一条消息,其中包含按下的键的名称作为数据。在这种情况下,处理器可能没有等待此消息,但其总线接口(实际上是与处理器同时运行的单独解释器)注意到出现了一条带有其总线地址的消息。总线接口将数据从总线复制到临时寄存器,确认该消息,并向处理器发送信号,使处理器在下一个指令周期执行中断。然后,中断处理程序将数据从临时寄存器传输到保存键盘输入的某个位置,可能是通过总线向其中一个内存模块发送另一条消息。
Simple I/O devices, such as keyboards, operate in a similar fashion. At system initialization time, one of the processors SENDs a message to the keyboard controller telling it to SEND all keystrokes to that processor. Each time that the user depresses a key, the keyboard controller SENDs a message to the processor containing as data the name of the key that was depressed. In this case, the processor is probably not waiting for this message, but its bus interface (which is in effect a separate interpreter running concurrently with the processor) notices that a message with its bus address has appeared. The bus interface copies the data from the bus into a temporary register, acknowledges the message, and sends a signal to the processor that will cause the processor to perform an interrupt on its next instruction cycle. The interrupt handler then transfers the data from the temporary register to some place that holds keyboard input, perhaps by SENDing yet another message over the bus to one of the memory modules.
这种设计的一个潜在问题是,中断处理程序必须在键盘处理程序发送另一条击键消息之前响应并从临时寄存器中读取击键数据。由于键盘打字速度与计算机速度相比较慢,因此中断处理程序很有可能及时读取数据,以免下一次击键覆盖数据。但是,硬盘等速度更快的设备可能会覆盖临时寄存器。一种解决方案是编写一个在紧密循环中运行的处理器程序,等待磁盘控制器通过总线发送的数据,并立即通过总线将该数据再次发送至内存模块。
One potential problem of this design is that the interrupt handler must respond and read the keystroke data from the temporary register before the keyboard handler SENDs another keystroke message. Since keyboard typing is slow compared with computer speeds, there is a good chance that the interrupt handler will be there in time to read the data before the next keystroke overwrites it. However, faster devices such as a hard disk might overwrite the temporary register. One solution would be to write a processor program that runs in a tight loop, waiting for data that the disk controller sends over the bus and immediately SENDing that data again over the bus to a memory module.
一些低端计算机设计确实做到了这一点,但设计人员可以通过升级磁盘控制器以使用称为直接内存访问(DMA)的技术来获得更高的性能。使用这种技术,当处理器向磁盘控制器发送请求以从磁盘读取数据块时,它会将内存中缓冲区的地址作为请求消息的字段。然后,当数据从磁盘流入时,磁盘控制器将其直接发送到内存模块,并在发送之间适当增加内存地址。除了减轻处理器的负载外,DMA 还减少了共享总线的负载,因为它只将每块数据传输到总线一次(从磁盘控制器到内存),而不是两次(首先从磁盘控制器到处理器,然后从处理器到内存)。此外,如果总线允许长消息,DMA 控制器可能能够比处理器更好地利用该功能,处理器通常设计为以与其自己的寄存器大小相同的单位发送和接收总线数据。通过发送较长的消息,DMA 控制器可以提高性能,因为它可以分摊总线仲裁协议的开销,而总线仲裁协议必须每条消息执行一次。最后,DMA 允许处理器在磁盘控制器传输数据的同时执行其他程序。由于并发操作可以隐藏磁盘传输的延迟,因此它可以提供额外的性能增强。第6 章将进一步讨论通过隐藏延迟来提高性能的想法。
Some low-end computer designs do exactly that, but a designer can obtain substantially higher performance by upgrading the disk controller to use a technique called direct memory access, or DMA. With this technique, when a processor SENDs a request to a disk controller to READ a block of data from the disk, it includes the address of a buffer in memory as a field of the request message. Then, as data streams in from the disk, the disk controller SENDs it directly to the memory module, incrementing the memory address appropriately between SENDs. In addition to relieving the load on the processor, DMA also reduces the load on the shared bus because it transfers each piece of data across the bus just once (from the disk controller to the memory) rather than twice (first from the disk controller to the processor and then from the processor to the memory). Also, if the bus allows long messages, the DMA controller may be able to take better advantage of that feature than the processor, which is usually designed to SEND and RECEIVE bus data in units that are the same size as its own registers. By SENDing longer messages, the DMA controller increases performance because it amortizes the overhead of the bus arbitration protocol, which it must perform once per message. Finally, DMA allows the processor to execute some other program at the same time that the disk controller is transferring data. Because concurrent operation can hide the latency of the disk transfer, it can provide an additional performance enhancement. The idea of enhancing performance by hiding latency is discussed further in Chapter 6.
一种方便的 I/O 和其他总线连接模块接口是将总线地址分配给模块的控制寄存器和缓冲区。由于每个处理器将总线地址直接映射到其自己的内存地址空间,因此在处理器中执行的LOAD和STORE指令实际上可以寻址 I/O 模块的寄存器和缓冲区,就好像它们是内存中的位置一样。该技术称为内存映射 I/O。
A convenient interface to I/O and other bus-attached modules is to assign bus addresses to the control registers and buffers of the module. Since each processor maps bus addresses directly into its own memory address space, LOAD and STORE instructions executed in the processor can in effect address the registers and buffers of the I/O module as if they were locations in memory. The technique is known as memory-mapped I/O.
内存映射 I/O 可以与 DMA 结合使用。例如,假设为内存映射 I/O 设计的磁盘控制器将总线地址分配给其四个控制寄存器,如下所示:
Memory-mapped I/O can be combined with DMA. For example, suppose that a disk controller designed for memory-mapped I/O assigns bus addresses to four of its control registers as follows:
总线地址控制寄存器
bus address control register
121 扇区编号
121 sector_number
122 DMA_起始地址
122 DMA_start_address
123 DMA_计数
123 DMA_count
124 控制
124 control
要执行磁盘 I/O,处理器使用STORE指令将适当的初始化值发送到前三个磁盘控制器寄存器,并使用最后的STORE指令发送一个值,该值设置控制寄存器中的某个位,磁盘控制器会将该位解释为启动信号。程序要获取当前存储在扇区号 11742 的 256 字节磁盘扇区,并将数据传输到从位置 3328 开始的内存中,首先用这些值加载四个寄存器,然后将寄存器的STORE发送到相应的总线地址:
To do disk I/O, the processor uses STORE instructions to SEND appropriate initialization values to the first three disk controller registers and a final STORE instruction to SEND a value that sets a bit in the control register that the disk controller interprets as the signal to start. A program to GET a 256-byte disk sector currently stored at sector number 11742 and transfer the data into memory starting at location 3328 starts by loading four registers with these values and then issuing STOREs of the registers to the appropriate bus addresses:
R1 ←11742; R2←3328; R3←256; R4←1;
R1 ← 11742; R2 ← 3328; R3 ← 256; R4 ← 1;
STORE 121,R1 //设置扇区号
STORE 121,R1 // set sector number
STORE 122,R2 //设置内存地址寄存器
STORE 122,R2 // set memory address register
STORE 123,R3 //设置字节数
STORE 123,R3 // set byte count
STORE 124,R4 //启动磁盘控制器运行
STORE 124,R4 // start disk controller running
在完成由上一条STORE指令生成的总线SEND后,之前处于空闲状态的磁盘控制器将立即开始工作,将请求的扇区从磁盘读入内部缓冲区,并开始使用 DMA 将缓冲区的内容一次一个块地传输到内存中。如果总线可以处理长度为 8 字节的块,则磁盘控制器将发送一系列总线消息,例如
Upon completion of the bus SEND generated by the last STORE instruction, the disk controller, which was previously idle, leaps into action, reads the requested sector from the disk into an internal buffer, and begins using DMA to transfer the contents of the buffer to memory one block at a time. If the bus can handle blocks that are 8 bytes long, the disk controller would SEND a series of bus messages such as
磁盘控制器 #1 ⇒所有总线模块:{3328,块[1]}
disk controller #1 ⇒ all bus modules:{3328, block[1]}
磁盘控制器 #1 ⇒所有总线模块:{3336,块[2]}
disk controller #1 ⇒ all bus modules:{3336, block[2]}
ETC …
etc …
内存映射 I/O 是一种流行的接口,因为它为每个实现它的总线模块提供了统一的类似内存的LOAD和STORE接口。另一方面,设计人员在尝试过度扩展内存映射模型时必须谨慎。例如,在具有 32 位地址空间的系统中,尝试安排处理器可以直接寻址磁盘上的单个字节或字可能会出现问题,因为小到 4 GB 的磁盘会占用整个地址空间。更重要的是,与处理器的周期时间相比,磁盘的延迟非常大。对于 STORE 指令,有时在几纳秒内完成(当地址在电子内存中时),有时需要 10 毫秒才能完成(当地址在磁盘上时),这将是非常出乎意料的,并且会使编写具有可预测性能的程序变得困难。此外,它还会违反人类工程学的基本规则,即最小惊讶原则(参见边栏 2.5)。最重要的是,磁盘的物理特性使得 DMA 访问模型比内存映射 I/O 模型更合适。
Memory-mapped I/O is a popular interface because it provides a uniform memory-like LOAD and STORE interface to every bus module that implements it. On the other hand, the designer must be cautious in trying to extend the memory-mapped model too far. For example, trying to arrange so the processor can directly address individual bytes or words on a magnetic disk could be problematic in a system with a 32-bit address space because a disk as small as 4 gigabytes would use up the entire address space. More important, the latency of a disk is extremely large compared with the cycle time of a processor. For the STORE instruction to sometimes operate in a few nanoseconds (when the address is in electronic memory) and other times require 10 milliseconds to complete (when the address is on the disk) would be quite unexpected and would make it difficult to write programs that have predictable performance. In addition, it would violate a fundamental rule of human engineering, the principle of least astonishment (see Sidebar 2.5). The bottom line is that the physical properties of the magnetic disk make the DMA access model more appropriate than the memory-mapped I/O model.
附注 2.5 人体工程学和最小惊讶原则
Sidebar 2.5 Human Engineering and the Principle of Least Astonishment
人体工程学中可用性的一个重要原则是最小惊讶原则,对于计算机系统来说,可用性意味着设计它们使其易于设置、易于使用、易于编程和易于维护。
An important principle of human engineering for usability, which for computer systems means designing to make them easy to set up, easy to use, easy to program, and easy to maintain, is the principle of least astonishment.
最小惊讶原则
The Principle of Least Astonishment
人是系统的一部分。设计应该符合用户的体验、期望和心理模型。
People are part of the system. The design should match the user’s experience, expectations, and mental models.
人类会为他们遇到的所有事物的行为建立心理模型:组件、界面和系统。如果实际的组件、界面或系统遵循该心理模型,则更有可能按预期使用,误用或误解导致错误或失望的可能性也会降低。由于复杂性与理解有关,因此该原则也倾向于帮助降低复杂性。
Human beings make mental models of the behavior of everything they encounter: components, interfaces, and systems. If the actual component, interface, or system follows that mental model, there is a better chance that it will be used as intended and less chance that misuse or misunderstanding will lead to a mistake or disappointment. Since complexity is relative to understanding, the principle also tends to help reduce complexity.
因此,在选择设计方案时,通常最好选择最有可能满足使用、应用或维护系统的人的期望的方案。在评估权衡时,该原则也应是一个因素。它适用于系统设计的所有方面,尤其是人机界面设计和计算机安全。
For this reason, when choosing among design alternatives, it is usually better to choose one that is most likely to match the expectations of those who will have to use, apply, or maintain the system. The principle should also be a factor when evaluating trade-offs. It applies to all aspects of system design, especially to the design of human interfaces and to computer security.
需要注意以下推论:保持一致。可预测。尽量减少副作用。使用描述性的名称。做显而易见的事情。为所有合理的输入提供合理的解释。避免不必要的变化。
Some corollaries are to be noted: Be consistent. Be predictable. Minimize side-effects. Use names that describe. Do the obvious thing. Provide sensible interpretations for all reasonable inputs. Avoid unnecessary variations.
有些作者更喜欢用“最小惊讶原则”而不是“最小惊讶原则”。当贝叶斯统计学家引用最小惊讶原则时,他们通常是指“选择最有可能的解释”,这是与此密切相关的奥卡姆剃刀原理的一个版本。(参见第 9 页底部的格言。)
Some authors prefer the words “principle of least surprise” to “principle of least astonishment”. When Bayesian statisticians invoke the principle of least surprise, they usually mean “choose the mostly likely explanation”, a version of the closely related Occam’s razor. (See the aphorism at the bottom of page 9.)
人体工程学和原始墨菲定律。如果你问一群人“什么是墨菲定律?”大多数人的回答都是“如果任何事情可能出错,它就会出错”,后面还有无数的等价词,例如面包总是涂黄油的一面朝下。
Human Engineering and the Original Murphy’s Law. If you ask a group of people “What is Murphy’s law?” most responses will be some variation of “If anything can go wrong, it will”, followed by innumerable equivalents, such as the toast always falls butter side down.
事实上,墨菲最初说的是完全不同的东西。墨菲不是在评论无生命物体的内在反常性(有时被称为菲纳格尔定律,来自科幻故事),而是在评论设计复杂系统时必须考虑的人性属性:如果你设计它,它可能会组装错误,那么有人会把它组装错误。墨菲指出了人类对需要组装的东西进行良好工程设计的智慧:设计它们,使它们组装的唯一方法是正确的方法。
In fact, Murphy originally said something quite different. Rather than a comment on the innate perversity of inanimate objects (sometimes known as Finagle’s law, from a science fiction story), Murphy was commenting on a property of human nature that one must take into account when designing complex systems: If you design it so that it can be assembled wrong, someone will assemble it wrong. Murphy was pointing out the wisdom of good human engineering of things that are to be assembled: design them so that the only way to assemble them is the right way.
爱德华·A·墨菲二世是一名工程师,1949 年,他参与了爱德华空军基地的美国空军火箭滑橇实验。在实验中,约翰·保罗·斯塔普少校自愿接受极端减速 (40 G) 的冲击,以确定人类对弹射座椅设计的耐受极限。在一次实验中,有人将所有应变计的接线都接错了,因此斯塔普 (痛苦的) 飞行结束后没有得到任何可用的数据。墨菲对接线应变计的技术人员感到恼火,他说:“如果那家伙能找到做错事的方法,他就会一直做错。”斯塔普一有机会就编造定律,他将这一观察结果命名为“墨菲定律”,并几乎立即开始以另一种现在广为人知的形式向其他人讲述这一定律:“如果任何事情可能出错,它就会出错。”
Edward A. Murphy, Jr., was an engineer working on United States Air Force rocket sled experiments at Edwards Air Force Base in 1949, in which Major John Paul Stapp volunteered to be subjected to extreme decelerations (40 Gs) to determine the limits of human tolerance for ejection seat design. On one of the experiments, someone wired up all of the strain gauges incorrectly, so at the end of Stapp’s (painful) ride there was no usable data. Murphy said, in exasperation at the technician who wired up the strain gauges, “if that guy can find a way to do it wrong, he will.” Stapp, who as a hobby made up laws at every opportunity, christened this observation “Murphy’s law,” and almost immediately began telling it to others in the different and now widely known form “If anything can go wrong, it will.”
1997 年,一架康维尔 580 货机发生事故,很好地体现了墨菲最初的观察。两根相同的控制电缆从驾驶舱控制器连接到升降舵调整片,升降舵调整片是后稳定翼上的一个小型可移动表面,当向上或向下调整时,飞机机头分别上升或下降。在维修后的第一次飞行起飞时,飞行员发现飞机机头向上俯仰。他们试图将调整片调整到最大机头向下的位置,但问题变得更糟。经过很大努力,他们终于让飞机安全降落。当机械师检查飞机时,他们发现连接调整片的两根电缆互换了,因此向上移动控制器会导致调整片向下,反之亦然。*
A good example of Murphy’s original observation in action showed up in an incident on a Convair 580 cargo plane in 1997. Two identical control cables ran from a cockpit control to the elevator trim tab, a small movable surface on the rear stabilizing wing that, when adjusted up or down, forces the nose of the plane to rise or drop, respectively. Upon take-off on the first flight after maintenance, the pilots found that the plane was pitching nose-up. They tried adjusting the trim tab to maximum nose-down position, but the problem just got worse. With much effort they managed to land the plane safely. When mechanics examined the plane, they discovered that the two cables to the trim tab had been interchanged, so that moving the control up caused the trim tab to go down and vice versa.*
1988 年和 1989 年发生过一系列类似事件,涉及波音 737、757 和 767 飞机货舱烟雾报警器信号线和灭火器控制线的交叉连接。†
A similar series of incidents in 1988 and 1989 involved crossed connections in cargo area smoke alarm signal wires and fire extinguisher control wires in the Boeing 737, 757, and 767 aircraft.†
*加拿大运输安全委员会,报告 A97O0077,2000年 1 月 13 日,2002 年 10 月 6 日更新。
* Transportation Safety Board of Canada, Report A97O0077, January 13, 2000, updated October 6, 2002.
† Karen Fitzgerald,“波音的交叉联系”,IEEE Spectrum 26,5(1989 年 5 月),第 30-35 页。
† Karen Fitzgerald, “Boeing’s crossed connections”, IEEE Spectrum 26, 5 (May 1989), pages 30–35.
计算机系统的中层和高层通常以软件模块的形式实现。为了使这种分层组织具体化,请考虑文件,它是内存抽象的高级版本。文件包含位或字节数组,其数量由应用程序选择。文件具有两个关键属性:
The middle and higher layers of a computer system are usually implemented as software modules. To make this layered organization concrete, consider the file, a high-level version of the memory abstraction. A file holds an array of bits or bytes, the number of which the application chooses. A file has two key properties:
它具有持久性。信息一旦存储,即使系统关闭,也将保持完整,并可在稍后(可能是几周或几个月后)检索。应用程序使用文件来持久存储文档、工资单数据、电子邮件、程序以及它们不想丢失的任何其他内容。
It is durable. Information, once stored, will remain intact through system shutdowns and can be retrieved later, perhaps weeks or months later. Applications use files to durably store documents, payroll data, e-mail messages, programs, and anything else they do not want to be lost.
它有一个名称。文件名允许用户和程序以某种方式存储信息,以便他们以后可以找到并再次使用。文件名还使用户能够共享信息。一个人可以写入一个命名文件并告诉朋友文件名,然后朋友可以使用该名称读取该文件。
It has a name. The name of a file allows users and programs to store information in such a way that they can find and use it again at a later time. File names also make it possible for users to share information. One can WRITE a named file and tell a friend the file name, and then the friend can use the name to READ the file.
综合起来,这两个特性意味着,如果 Alice 创建了一个名为“战略计划”的新文件,在其中写入了一些信息,关闭计算机,第二天再次打开计算机,她将能够读取名为“战略计划”的文件并取回其内容。此外,她可以告诉 Bob 查看名为“战略计划”的文件。当 Bob 要求系统读取同名文件时,他将读取她创建的文件。大多数文件系统还为文件提供其他附加属性,如时间戳(用于确定文件的创建、最后修改或最后使用时间)、文件持久性的保证(第 10 章 [在线] 重新讨论的主题)以及控制谁可以共享文件的能力(第 11 章 [在线] 的主题之一)。
Taken together, these two features mean that if, for example, Alice creates a new file named “strategic plan”, WRITEs some information in it, shuts down the computer, and the next day turns it on again, she will then be able to READ the file named “strategic plan” and get back its content. Furthermore, she can tell Bob to look at the file named “strategic plan”. When Bob asks the system to READ a file with that name, he will read the file that she created. Most file systems also provide other additional properties for files, such as timestamps to determine when they were created, last modified, or last used, assurances about their durability (a topic that Chapter 10 [on-line] revisits), and the ability to control who may share them (one of the topics of Chapter 11 [on-line]).
系统层使用硬件层的模块来实现文件。图 2.18显示了一个简单的应用程序的伪代码,它从键盘设备读取输入,将该输入写入文件,并将其显示在显示设备上。
The system layer implements files using modules from the hardware layer. Figure 2.18 shows the pseudocode of a simple application that reads input from a keyboard device, writes that input to a file, and also displays it on the display device.
图 2.18使用文件抽象实现显示程序,该程序还将键盘输入写入文件中。为清楚起见,此程序忽略了任何抽象文件原语可能返回错误状态的可能性。
Figure 2.18 Using the file abstraction to implement a display program, which also writes the keyboard input in a file. For clarity, this program ignores the possibility that any of the abstract file primitives may return an error status.
文件抽象的典型 API 包含对OPEN文件、READ和WRITE文件部分以及CLOSE文件的调用。OPEN调用将文件名转换为本地名称空间中的临时名称,以供READ和WRITE操作使用。此外,OPEN通常会检查是否允许此用户访问该文件。作为最后一步,OPEN将游标(有时称为文件指针)设置为零。游标记录从文件开头的偏移量,用作READ和WRITE的起点。某些文件系统设计为READ和WRITE提供了单独的游标,在这种情况下OPEN可能会将WRITE游标初始化为文件中的字节数。
A typical API for the file abstraction contains calls to OPEN a file, to READ and WRITE parts of the file, and to CLOSE the file. The OPEN call translates the file name into a temporary name in a local name space to be used by the READ and WRITE operations. Also, OPEN usually checks whether this user is permitted access to the file. As its last step, OPEN sets a cursor, sometimes called a file pointer, to zero. The cursor records an offset from the beginning of the file to be used as the starting point for READs and WRITEs. Some file system designs provide a separate cursor for READs and WRITEs, in which case OPEN may initialize the WRITE cursor to the number of bytes in the file.
调用READ会将文件中指定数量的字节传递给调用者,从READ游标开始。它还会将读取的字节数添加到READ游标中,以便下一个READ可以从上一个READ停止的地方继续。如果程序要求读取文件末尾以外的字节,READ将返回某种文件末尾状态指示符。
A call to READ delivers to the caller a specified number of bytes from the file, starting from the READ cursor. It also adds to the READ cursor the number of bytes read so that the next READ proceeds where the previous READ left off. If the program asks to read bytes beyond the end of the file, READ returns some kind of end-of-file status indicator.
类似地,WRITE操作将一个包含字节和长度的缓冲区作为参数,将这些字节存储在文件中,从WRITE光标指示的偏移量开始(如果WRITE光标从文件末尾开始或到达文件末尾,WRITE通常意味着扩展文件的大小),并将写入的字节数添加到WRITE光标,以便下一个WRITE可以从那里继续。如果设备上没有足够的空间来写入那么多字节,WRITE过程将失败,并返回某种设备已满错误状态或异常。
Similarly, the WRITE operation takes as arguments a buffer with bytes and a length, stores those bytes in the file starting at the offset indicated by the WRITE cursor (if the WRITE cursor starts at or reaches the end of the file, WRITE usually implies extending the size of the file), and adds to the WRITE cursor the number of bytes written so that the next WRITE can continue from there. If there is not enough space on the device to write that many bytes, the WRITE procedure fails by returning some kind of device-full error status or exception.
最后,当程序完成读写操作后,它会调用CLOSE过程。CLOSE释放文件系统为文件维护的任何内部状态(例如,游标和临时文件名的记录,这些记录不再有意义)。某些文件系统还确保当CLOSE返回时,修改后的文件的所有部分都已持久存储在非易失性存储设备上。其他文件系统在CLOSE返回后在后台执行此操作。
Finally, when the program is finished reading and writing, it calls the CLOSE procedure. CLOSE frees up any internal state that the file system maintains for the file (for example, the cursors and the record of the temporary file name, which is no longer meaningful). Some file systems also ensure that, when CLOSE returns, all parts of the modified file have been stored durably on a non-volatile memory device. Other file systems perform this operation in the background after CLOSE returns.
文件系统模块通过将文件的字节映射到磁盘扇区来实现文件 API。对于每个文件,文件系统都会创建文件名称和存储文件的磁盘扇区的记录。文件系统还将此记录存储在磁盘上。当计算机重新启动时,文件系统必须以某种方式发现它留下这些记录的位置,以便再次找到文件。名称发现的典型过程是文件系统保留一个众所周知的磁盘扇区(例如扇区号 1),并使用该众所周知的磁盘扇区作为立足点来定位它留下其余文件系统信息的扇区。UNIX 文件系统 API 及其实现的详细描述在第2.5 节中。
The file system module implements the file API by mapping bytes of the file to disk sectors. For each file the file system creates a record of the name of the file and the disk sectors in which it has stored the file. The file system also stores this record on the disk. When the computer restarts, the file system must somehow discover the place where it left these records so that it can again find the files. A typical procedure for name discovery is for the file system to reserve one, well-known, disk sector such as sector number 1, and use that well-known disk sector as a toehold to locate the sectors where it left the rest of the file system information. A detailed description of the UNIX file system API and its implementation is in Section 2.5.
你也许会奇怪为什么文件 API除了支持READ和WRITE之外还支持OPEN和CLOSE;毕竟,你可以要求程序员在每次READ和WRITE调用时传递文件名和文件位置偏移量。原因是OPEN和CLOSE过程标记一系列相关READ和WRITE操作的开始和结束,以便文件系统知道哪些读写属于一个组。进行分组以及在分组中使用临时文件名有几个很好的理由。最初,性能和资源管理问题促使引入OPEN和CLOSE,但后来的接口实现利用了OPEN和CLOSE的存在,在并发文件访问和故障下提供清晰的语义。
One might wonder why the file API supports OPEN and CLOSE in addition to READ and WRITE; after all, one could ask the programmer to pass the file name and a file position offset on each READ and WRITE call. The reason is that the OPEN and CLOSE procedures mark the beginning and the end of a sequence of related READ and WRITE operations so that the file system knows which reads and writes belong together as a group. There are several good reasons for grouping and for the use of a temporary file name within the grouping. Originally, performance and resource management concerns motivated the introduction of OPEN and CLOSE, but later implementations of the interface exploited the existence of OPEN and CLOSE to provide clean semantics under concurrent file access and failures.
早期的文件系统引入了OPEN来分摊解析文件名的成本。文件名是可能包含多个部分的路径名。通过在OPEN上解析文件名一次并为结果赋予简单名称,READ和WRITE避免了在每次调用时解析名称。同样,OPEN可以分摊检查用户是否具有使用文件的适当权限的成本。
Early file systems introduced OPEN to amortize the cost of resolving a file name. A file name is a path name that may contain several components. By resolving the file name once on OPEN and giving the result a simple name, READ and WRITE avoid having to resolve the name on each invocation. Similarly, OPEN amortizes the cost of checking whether the user has the appropriate permissions to use the file.
引入CLOSE是为了简化资源管理:当应用程序调用CLOSE时,文件系统知道该应用程序不需要文件系统内部维护的资源(例如,光标)。即使第二个应用程序在第一个应用程序完成读写文件之前删除了文件,文件系统也可以合理地为第一个应用程序实现READ和WRITE过程(例如,只有在所有OPEN该文件的人都调用CLOSE之后,才丢弃文件的内容)。
CLOSE was introduced to simplify resource management: when an application invokes CLOSE, the file system knows that the application doesn’t need the resources (e.g., the cursor) that the file system maintains internally. Even if a second application removes a file before a first application is finished reading and writing the file, the file system can implement READ and WRITE procedures for the first application sensibly (for example, discard the contents of the file only after everyone that OPENed the file has called CLOSE).
较新的文件系统使用OPEN和CLOSE来标记原子操作的开始和结束。文件系统可以将所有中间的READ和WRITE调用视为单个不可分割的操作,即使在同时访问文件或系统崩溃(部分但不是全部WRITE已完成)的情况下也是如此。随之而来的是两个机会:
More recent file systems use OPEN and CLOSE to mark the beginning and end of an atomic action. The file system can treat all intervening READ and WRITE calls as a single indivisible operation, even in the face of concurrent access to the file or a system crash after some but not all of the WRITEs have completed. Two opportunities ensue:
1.文件系统可以使用OPEN和CLOSE操作来协调对文件的并发访问:如果一个程序打开了一个文件,而另一个程序试图OPEN同一个文件,则文件系统可以让第二个程序等待,直到第一个程序CLOSE该文件。这种协调是前后原子性的一个例子,第 5.2.4 节将深入探讨这个主题。
1. The file system can use the OPEN and CLOSE operations to coordinate concurrent access to a file: if one program has a file open and another program tries to OPEN that same file, the file system can make the second program wait until the first one has CLOSEd the file. This coordination is an example of before-or-after atomicity, a topic that Section 5.2.4 explores in depth.
2.如果文件系统在应用程序CLOSE该文件之前崩溃(例如,由于电源故障) ,则系统恢复时,文件中不会有任何WRITE 。如果在应用程序CLOSE该文件之后崩溃,则文件中会包含所有WRITE。并非所有文件系统都提供这种保证,即所谓的全有或全无原子性,因为它并不容易正确有效地实现,如第 9 章 [在线] 所述。
2. If the file system crashes (for example, because of a power failure) before the application CLOSEs the file, none of the WRITEs will be in the file when the system comes back up. If it crashes after the application CLOSEd the file, all of the WRITEs will be in the file. Not all file systems provide this guarantee, known as all-or-nothing atomicity, since it is not easy to implement correctly and efficiently, as Chapter 9 [on-line] explains.
OPEN / CLOSE模型是有代价的:文件系统必须以解析的文件名和游标的形式维护每个客户端的状态。可以设计一个完全无状态的文件接口。一个例子是网络文件系统,如第4.5 节所述。
There is a cost to the OPEN/CLOSE model: the file system must maintain per-client state in the form of the resolved file name and the cursor(s). It is possible to design a completely stateless file interface. An example is the Network File System, described in Section 4.5.
文件是一种非常方便的内存抽象,在某些系统(例如UNIX系统及其衍生系统)中,计算机系统中的每个输入/输出设备都提供文件接口(参见图 2.19)。在这样的系统中,文件不仅是非易失性存储器(例如磁盘)的抽象,而且还是键盘设备、显示器、通信链路等的便捷接口。在这样的系统中,每个 I/O 设备在文件命名方案中都有一个名称。程序OPEN键盘设备、从键盘设备READ字节,然后CLOSE键盘设备,而不必知道有关键盘管理过程的任何细节,例如键盘类型等。类似地,为了与显示器交互,程序可以OPEN显示设备、WRITE并CLOSE它。程序不需要知道有关显示器的任何细节。根据最小惊讶原则,每个设备管理过程为每种文件系统方法提供了一些合理的解释。图 2.18的伪代码体现了这种设计统一性的好处。
The file is such a convenient memory abstraction that in some systems (for example, the UNIX system and its derivatives) every input/output device in a computer system provides a file interface (see Figure 2.19). In such systems, files not only are an abstraction for non-volatile memories (e.g., magnetic disks), but they are also a convenient interface to the keyboard device, the display, communication links, and so on. In such systems, each I/O device has a name in the file naming scheme. A program OPENs the keyboard device, READs bytes from the keyboard device, and then CLOSEs the keyboard device, without having to know any details about the keyboard management procedure, what type of keyboard it is, and the like. Similarly, to interact with the display, a program can OPEN the display device, WRITE to it, and CLOSE it. The program need not know any details about the display. In accordance with the principle of least astonishment, each device management procedure provides some reasonable interpretation for every file system method. The pseudocode of Figure 2.18 exemplifies the benefit of this kind of design uniformity.
图 2.19使用文件抽象和分层来集成不同类型的输入和输出设备。文件系统充当提供统一抽象接口的中介,而各种设备管理器则是将抽象接口转换为不同设备的操作要求的程序。
Figure 2.19 Using the file abstraction and layering to integrate different kinds of input and output devices. The file system acts as an intermediary that provides a uniform, abstract interface, and the various device managers are programs that translate that abstract interface into the operational requirements for different devices.
这种统一接口的一个特点是,在许多情况下,只需重新绑定名称,就可以用文件替换 I/O 设备,反之亦然,而无需以任何方式修改应用程序。这种使用命名来支持模块化的做法在调试应用程序时特别有用。例如,可以通过将一个充满文本的文件放在键盘设备的位置来轻松测试需要键盘输入的程序。由于这些例子,文件系统抽象已被证明非常成功。
One feature of such a uniform interface is that in many situations one can, by simply rebinding the name, replace an I/O device with a file, or vice versa, without modifying the application program in any way. This use of naming in support of modularity is especially helpful when debugging an application program. For example, one can easily test a program that expects keyboard input by slipping a file filled with text in the place of the keyboard device. Because of such examples, the file system abstraction has proven to be very successful.
本章提出了几种思想和概念,为计算机系统设计的研究提供了有用的背景。首先,它描述了设计计算机系统时使用的三大抽象——内存、解释器和通信链路。然后,它提出了一个模型,说明如何使用名称将基于这些抽象的模块粘合在一起以创建有用的系统。最后,它根据三大抽象描述了典型的现代分层计算机系统的某些部分。有了这些背景知识,我们现在准备对特定的计算机系统设计主题进行一系列更深入的讨论。第3 章中第一次深入讨论的是围绕名称使用而产生的几个工程问题。其余各章都对不同的系统设计主题进行了类似的深入讨论。
This chapter has developed several ideas and concepts that provide useful background for the study of computer system design. First, it described the three major abstractions used in designing computer systems—memory, interpreters, and communication links. Then it presented a model of how names are used to glue together modules based on those abstractions to create useful systems. Finally, it described some parts of a typical modern layered computer system in terms of the three major abstractions. With this background, we are now prepared to undertake a series of more in-depth discussions of specific computer system design topics. The first such in-depth discussion, in Chapter 3, is of the several engineering problems surrounding the use of names. Each of the remaining chapters undertakes a similar in-depth discussion of a different system design topic.
在进行深入讨论之前,本章的最后一节将介绍抽象、命名和层在实践中如何出现的案例研究。案例研究使用这三个概念来描述UNIX系统。
Before moving on to those in-depth discussions, the last section of this chapter is a case study of how abstraction, naming, and layers appear in practice. The case study uses those three concepts to describe the UNIX system.
UNIX操作系统家族可以追溯到20世纪 60 年代末和 70 年代初贝尔电话实验室为数字设备公司 PDP 系列小型计算机开发的UNIX操作系统 [进一步阅读建议 2.2 ],而更早的则是 20 世纪 60 年代初的 Multics *操作系统 [进一步阅读建议 1.7.5和3.1.4 ]。如今, UNIX系统种类繁多,它们之间有着复杂的历史关系;例如 GNU/Linux、由不同组织分发的 GNU/Linux 版本(例如 Red Hat、Ubuntu)、Darwin(一种UNIX操作系统,是 Apple 操作系统 Mac OS X 的一部分)以及多种 BSD操作系统。其中一些直接源自早期的UNIX操作系统;其他一些提供了类似的接口,但是从头开始实现的。有些是一小群程序员努力的结果,而有些则是许多人努力的结果。在后一种情况下,甚至不清楚如何准确命名操作系统,因为大量部分来自不同的团队。†所有这些努力的共同结果是UNIX家族的操作系统可以在各种计算机上运行,包括个人计算机、服务器计算机、并行计算机和嵌入式计算机。大多数UNIX接口都是官方标准,‡非UNIX操作系统通常也支持此标准。由于某些版本的源代码可供公众使用,因此人们可以轻松地研究UNIX系统。
The UNIX family of operating systems can trace its lineage back to the UNIX operating system that was developed by Bell Telephone Laboratories for the Digital Equipment Corporation PDP line of minicomputers in the late 1960s and early 1970s [Suggestions for Further Reading 2.2], and before that to the Multics* operating system in the early 1960s [Suggestions for Further Reading 1.7.5 and 3.1.4]. Today there are many flavors of UNIX systems with complex historical relationships; a few examples include GNU/Linux, versions of GNU/Linux distributed by different organizations (e.g., Red Hat, Ubuntu), Darwin (a UNIX operating system that is part of Apple’s operating system Mac OS X), and several flavors of BSD OPERATING systems. Some of these are directly derived from the early UNIX operating system; others provide similar interfaces but have been implemented from scratch. Some are the result of an effort by a small group of programmers, and others are the result of an effort by many. In the latter case, it is even unclear how to exactly name the operating system because substantial parts come from different teams.† The collective result of all these efforts is that operating systems of the UNIX family run on a wide range of computers, including personal computers, server computers, parallel computers, and embedded computers. Most of the UNIX interface is an official standard,‡ and non-UNIX operating systems often support this standard too. Because the source code of some versions is available to the public, one can easily study the UNIX system.
本案例研究考察了UNIX文件系统在设计中使用名称的各种方式。在考察其如何实现命名方案的过程中,我们还将顺便对UNIX文件系统的组织方式有一个初步的了解。
This case study examines the various ways in which the UNIX file system uses names in its design. In the course of examining how it implements its naming scheme, we will also incidentally get a first-level overview of how the UNIX file system is organized.
程序可以用用户选择的名称创建文件,读取和写入文件的内容,以及设置和获取文件的元数据。示例元数据包括上次修改时间、文件所有者的用户 ID 以及其他用户的访问权限。(有关元数据的完整讨论,请参见第 3.1.2 节。)为了组织文件,用户可以将它们分组到具有用户选择的名称的目录中,从而创建命名网络。用户还可以将存储在存储设备上的命名网络嫁接到现有的命名网络上,从而允许将不同设备的命名网络合并到单个大型命名网络中。为了支持这些操作,UNIX文件系统提供了表 2.1所示的应用程序编程接口 (API) 。
A program can create a file with a user-chosen name, read and write the file’s content, and set and get a file’s metadata. Example metadata include the time of last modification, the user ID of the file’s owner, and access permissions for other users. (For a full discussion of metadata see Section 3.1.2.) To organize their files, users can group them in directories with user-chosen names, creating a naming network. Users can also graft a naming network stored on a storage device onto an existing naming network, allowing naming networks for different devices to be incorporated into a single large naming network. To support these operations, the UNIX file system provides the application programming interface (API) shown in Table 2.1.
表 2.1. UNIX文件系统应用程序编程接口
Table 2.1. UNIX File System Application Programming Interface
| 程序 | 简要描述;简介 |
| 打开(名称,标志,模式) | 打开文件名。如果文件不存在且设置了标志,则使用权限模式创建文件。将文件光标设置为 0。返回文件描述符。 |
| 读取(fd,buf,n) | 从当前光标处的文件中读取n 个字节,并将光标增加读取的字节数。 |
| 写入(fd,buf,n) | 在当前光标处写入n 个字节,并将光标增加写入的字节数。 |
| SEEK ( fd,偏移量,何时) | 将光标设置为距开始、结束或当前位置偏移字节数。 |
| 关闭( fd ) | 删除文件描述符。如果这是对该文件的最后一次引用,则删除该文件。 |
| FSYNC(fd) | 使对文件的所有更改持久。 |
| STAT(名称) | 读取文件的元数据。 |
| CHMOD、CHOWN等。 | 设置特定元数据的各种程序。 |
| 重命名(来自名称,到名称) | 将名称从from_name更改为to_name |
| LINK (名称,链接名称) | 创建到文件name的硬链接link_name。 |
| 取消链接(名称) | 从目录中删除name 。如果name是文件的最后一个名称,则删除文件。 |
| SYMLINK (名称,链接名称) | 为文件name创建一个符号名link_name。 |
| MKDIR (名称) | 创建名为name 的新目录。 |
| CHDIR(名称) | 将当前工作目录更改为name。 |
| CHROOT(名称) | 将默认根目录更改为name。 |
| MOUNT(名称,设备) | 将设备上的文件系统嫁接到name处的名称空间上。 |
| 卸载(名称) | 卸载名称处的文件系统。 |
为了解决实现此 API 的问题,UNIX文件系统采用了分而治之的策略。UNIX 文件系统利用多层隐藏的面向机器的名称(即地址)来实现文件。然后,它应用UNIX持久对象命名方案将用户友好的名称映射到这些文件。表 2.2说明了此结构。
To tackle the problem of implementing this API, the UNIX file system employs a divide-and-conquer strategy. The UNIX file system makes use of several hidden layers of machine-oriented names (that is, addresses), one on top of another, to implement files. It then applies the UNIX durable object naming scheme to map user-friendly names to these files. Table 2.2 illustrates this structure.
表 2.2. UNIX文件系统的命名层。
Table 2.2. The naming layers of the UNIX file system.
| 层 | 目的 | |
| 符号链接层 | 使用符号链接集成多个文件系统。 | ↑ 以用户为中心的名称 ↓ |
| 绝对路径名层 | 为命名层次结构提供根。 | |
| 路径名称层 | 将文件组织成命名层次结构。 | |
| 文件名层 | 为文件提供人性化的名称。 | 机器用户界面 |
| Inode 编号层 | 为文件提供面向机器的名称。 | ↑ 面向机器的名称 ↓ |
| 文件层 | 将块组织成文件。 | |
| 块层 | 识别磁盘块。 |
在本节的其余部分,我们从表 2.2的底层向上到顶层,从系统的最低层向上到用户。此描述与UNIX系统版本 6 的实现非常吻合,后者可以追溯到 20 世纪 70 年代早期。版本 6 有详尽的文档 [进一步阅读建议 2.2.2 ],并抓住了许多现代UNIX文件系统中的重要思想,但现代版本更为复杂;它们提供了更好的稳健性,并且可以更有效地处理大文件、多文件等。我们将在几个地方指出其中的一些差异,但鼓励读者查阅文件系统文献中的论文,以了解现代UNIX文件系统的工作原理和发展情况。
In the rest of this section we work our way up from the bottom layer of Table 2.2 to the top layer, proceeding from the lowest layer of the system up toward the user. This description corresponds closely to the implementation of Version 6 of the UNIX system, which dates back to the early 1970s. Version 6 is well documented [Suggestions for Further Reading 2.2.2] and captures the important ideas that are found in many modern UNIX file systems, but modern versions are more complex; they provide better robustness and handle large files, many files, and so on, more efficiently. In a few places we will point out some of these differences, but the reader is encouraged to consult papers in the file system literature to find out how modern UNIX file systems work and are evolving.
在底层,UNIX文件系统指代一些能够持久存储数据的物理设备,例如磁盘、闪存盘或磁带。此类设备上的存储被分成固定大小的单位,称为块。对于磁盘(参见边栏 2.2),一个块对应于少量的磁盘扇区。块是磁盘空间的最小分配单位,其大小是几个目标之间的权衡。小的块可减少小文件所浪费的磁盘量;如果许多文件小于 4 KB,则 16 KB 的块大小会浪费空间。另一方面,非常小的块大小可能需要较大的数据结构来跟踪空闲块和已分配块。此外,还有一些性能考虑因素也会影响块大小,我们将在第 6 章中讨论其中一些因素。在版本 6 中,UNIX文件系统使用 512 字节的块,但现代UNIX文件系统通常使用 8 KB 的块。
At the bottom layer the UNIX file system names some physical device such as a magnetic disk, flash disk, or magnetic tape that can store data durably. The storage on such a device is divided into fixed-size units, called blocks. For a magnetic disk (see Sidebar 2.2), a block corresponds to a small number of disk sectors. A block is the smallest allocation unit of disk space, and its size is a trade-off between several goals. A small block reduces the amount of disk wasted for small files; if many files are smaller than 4 kilobytes, a 16-kilobyte block size wastes space. On the other hand, a very small block size may incur large data structures to keep track of free and allocated blocks. In addition, there are performance considerations that impact the block size, some of which we discuss in Chapter 6. In version 6, the UNIX file system used 512-byte blocks, but modern UNIX file systems often use 8-kilobyte blocks.
这些块的名称是数字,通常对应于块相对于设备开头的偏移量。在最底层的命名层中,存储设备可以看作是将块号绑定到物理块的上下文。块设备的名称映射算法很简单:它将块号作为输入并返回块。实际上,我们并不真正想要块本身——那将是一堆氧化铁。我们想要的是块的内容,因此该算法实际上实现了块名称和块内容之间的固定映射。如果我们将存储设备表示为块的线性数组,则以下代码片段实现了名称映射算法:
The names of these blocks are numbers, which typically correspond to the offset of the block from the beginning of the device. In the bottom naming layer, a storage device can be viewed as a context that binds block numbers to physical blocks. The name-mapping algorithm for a block device is simple: it takes as input a block number and returns the block. Actually, we don’t really want the block itself—that would be a pile of iron oxide. What we want is the contents of the block, so the algorithm actually implements a fixed mapping between block name and block contents. If we represent the storage device as a linear array of blocks, then the following code fragment implements the name-mapping algorithm:
过程 BLOCK_NUMBER_TO_BLOCK (整数 b )返回 块
procedure BLOCK_NUMBER_TO_BLOCK (integer b) returns block
返回 设备[ b ]
return device[b]
在这个简单的算法中,变量名device指的是某个特定的物理设备。在许多设备中,映射更为复杂。例如,硬盘驱动器可能在末尾保留一组备用块,并将任何损坏的块的块号重新绑定到备用块。硬盘驱动器本身可以分层实现,如第 8.5.4 节 [在线] 中所示。BLOCK_NUMBER_TO_BLOCK 返回的值是块b的内容。
In this simple algorithm the variable name device refers to some particular physical device. In many devices the mapping is more complicated. For example, a hard drive might keep a set of spare blocks at the end and rebind the block numbers of any blocks that go bad to spares. The hard drive may itself be implemented in layers, as will be seen in Section 8.5.4 [on-line]. The value returned by BLOCK_NUMBER_TO_BLOCK is the contents of block b.
名称发现:块的名称是来自紧凑集合的整数,但块层必须跟踪哪些块正在使用以及哪些块可供分配。正如我们将看到的,文件系统通常需要对文件系统在磁盘上的布局进行描述。作为此信息的锚点,UNIX文件系统以超级块开始,该超级块具有众所周知的名称(例如 1)。例如,超级块包含文件系统磁盘的大小(以块为单位)。(块 0 通常存储启动操作系统的小程序;参见边栏 5.3。)
Name discovery: The names of blocks are integers from a compact set, but the block layer must keep track of which blocks are in use and which are available for assignment. As we will see, the file system in general has a need for a description of the layout of the file system on disk. As an anchor for this information, the UNIX file system starts with a super block, which has a well-known name (e.g., 1). The super block contains, for example, the size of the file system’s disk in blocks. (Block 0 typically stores a small program that starts the operating system; see Sidebar 5.3.)
UNIX文件系统的不同实现对空闲块列表使用不同的表示。版本 6 实现将未使用块的块号列表保存在链接列表中,该列表存储在一些未使用块中。此列表的第一个块的块号存储在超级块中。分配块的调用会导致块层中的某个过程在列表数组中搜索空闲块,将其从列表中删除,并返回该块的块号。
Different implementations of the UNIX file system use different representations for the list of free blocks. The version 6 implementation keeps a list of block numbers of unused blocks in a linked list that is stored in some of the unused blocks. The block number of the first block of this list is stored in the super block. A call to allocate a block leads to a procedure in the block layer that searches the list array for a free block, removes it from the list, and returns that block’s block number.
现代UNIX文件系统通常使用位图来跟踪空闲块。位图中的位i记录块i是空闲的还是已分配的。位图本身存储在磁盘上的一个众所周知的位置(例如,紧接着超级块)。图 2.20显示了简单文件系统的可能磁盘布局。它以超级块开始,后面跟着一个位图,用于记录哪些磁盘块正在使用中。位图之后是 inode 表,每个文件都有一个条目(如下所述),后面跟着空闲的或已分配给某个文件的块。超级块包含位图和 inode 表的大小(以块为单位)。
Modern UNIX file systems often use a bitmap for keeping track of free blocks. Bit i in the bitmap records whether block i is free or allocated. The bitmap itself is stored at a well-known location on the disk (e.g., right after the super block). Figure 2.20 shows a possible disk layout for a simple file system. It starts with the super block, followed by a bitmap that records which disk blocks are in use. After the bitmap comes the inode table, which has one entry for each file (as explained next), followed by blocks that are either free or allocated to some file. The super block contains the size of the bitmap and inode table in blocks.
图 2.20简单文件系统的可能磁盘布局。
Figure 2.20 Possible disk layout for a simple file system.
用户需要存储大于一个块大小且可能随时间增大或缩小的项目。为了支持此类项目,UNIX文件系统为文件引入了下一层命名层。文件是任意长度的字节的线性数组。文件系统需要以某种方式记录哪些块属于每个文件。为了支持此要求,UNIX文件系统创建了一个索引节点(简称inode),作为文件元数据的容器。我们对 inode 的初始声明是:
Users need to store items that are larger than one block in size and that may grow or shrink over time. To support such items, the UNIX file system introduces a next naming layer for files. A file is a linear array of bytes of arbitrary length. The file system needs to record in some way which blocks belong to each file. To support this requirement, the UNIX file system creates an index node, or inode for short, as a container for metadata about the file. Our initial declaration of an inode is:
结构 inode
structure inode
整数 block_numbers [ N ] // 构成文件的块的编号
integer block_numbers[N] // the numbers of the blocks that constitute the file
整数 大小 //文件的大小(以字节为单位)
integer size // the size of the file in bytes
因此,文件的 inode 是一个上下文,其中文件的各个块由整数块号命名。使用此结构,解析文件中块名称的简化名称映射算法如下:
The inode for a file is thus a context in which the various blocks of the file are named by integer block numbers. With this structure, a simplified name-mapping algorithm for resolving the name of a block in a file is as follows:
过程 INDEX_TO_BLOCK_NUMBER ( inode 实例 i ,整数 索引)返回整数
procedure INDEX_TO_BLOCK_NUMBER (inode instance i, integer index) returns integer
返回 i . block_numbers [ index ]
return i.block_numbers[index]
版本 6 UNIX文件系统将此算法用于小文件,小文件限制为N = 8 个块。对于大文件,版本 6 使用更复杂的算法将inode 的第 index个块映射到块号。i.block_numbers 中的前七个条目是间接块。间接块不包含数据,但包含块号。例如,如果块大小为 512 字节,索引为2字节(与版本 6 中一样),则间接块可以包含 256 个 2 字节块号。第八个条目是双间接块(包含间接块的块号的块)。当N = 8 时,这种具有间接和双间接块的设计允许( N − 1) × 256 + 1 × 256 × 256 = 67,329个块,大约 32 兆字节。*问题集1探讨了一些设计权衡,以允许文件系统支持大文件。一些现代UNIX文件系统使用不同的表示形式或更复杂的数据结构(例如 B+ 树)来实现文件。
The version 6 UNIX file system uses this algorithm for small files, which are limited to N = 8 blocks. For large files, version 6 uses a more sophisticated algorithm for mapping the index-th block of an inode to a block number. The first seven entries in i.block_numbers are indirect blocks. Indirect blocks do not contain data, but block numbers. For example, with a block size of 512 bytes and an index of 2 bytes (as in Version 6), an indirect block can contain 256 2-byte block numbers. The eighth entry is a doubly indirect block (blocks that contain block numbers of indirect blocks). This design with indirect and doubly indirect blocks allows for (N − 1) × 256 + 1 × 256 × 256 = 67,329 blocks when N = 8, about 32 megabytes.* Problem set 1 explores some design trade-offs to allow the file system to support large files. Some modern UNIX file systems use different representations or more sophisticated data structures, such as B+ trees, to implement files.
UNIX文件系统允许用户通过分层前两种命名方案并将字节数指定为相对于文件开头的偏移量来命名文件中的任何特定字节:
The UNIX file system allows users to name any particular byte in a file by layering the previous two naming schemes and specifying the byte number as an offset from the beginning of the file:
1 过程 INODE_TO_BLOCK (整数 偏移量,inode 实例 i )返回 块
1 procedure INODE_TO_BLOCK (integer offset, inode instance i) returns block
2 o ←偏移量/块大小
2 o ← offset / BLOCKSIZE
3 b ←索引到块号(i,o)
3 b ← INDEX_TO_BLOCK_NUMBER (i, o)
4 返回 BLOCK_NUMBER_TO_BLOCK(b)
4 return BLOCK_NUMBER_TO_BLOCK (b)
返回的值是保存偏移量处字节值的整个块。版本 6 使用3 字节数字表示偏移量,这将最大文件大小限制为 2 24字节。现代UNIX文件系统使用 64 位数字。该过程返回包含指定字节的整个块。正如我们将在2.5.11 节中看到的那样,READ使用此过程返回请求的字节。
The value returned is the entire block that holds the value of the byte at offset. Version 6 used for offset a 3-byte number, which limits the maximum file size to 224 bytes. Modern UNIX file systems use a 64-bit number. The procedure returns the entire block that contains the named byte. As we will see in Section 2.5.11, READ uses this procedure to return the requested bytes.
与其传递 inode 本身,不如命名它们并传递它们的名称更方便。为了支持此功能,UNIX文件系统提供了另一个命名层,该层通过 inode 编号命名 inode。实现此命名层的一种方便方法是使用一个直接包含所有 inode 的表,按 inode 编号索引。命名算法如下:
Instead of passing inodes themselves around, it would be more convenient to name them and pass their names around. To support this feature, the UNIX file system provides another naming layer that names inodes by an inode number. A convenient way to implement this naming layer is to employ a table that directly contains all inodes, indexed by inode number. Here is the naming algorithm:
1 过程 INODE_NUMBER_TO_INODE (整数 inode_number )返回 inode
1 procedure INODE_NUMBER_TO_INODE (integer inode_number) returns inode
2 返回 inode_table [ inode_number ]
2 return inode_table[inode_number]
其中inode_table是存储在存储设备上固定位置(例如,在开头)的对象。inode_table的名称映射算法仅返回表的起始块号。
where inode_table is an object that is stored at a fixed location on the storage device (e.g., at the beginning). The name-mapping algorithm for inode_table just returns the starting block number of the table.
名称发现:inode 编号与磁盘块编号一样,是一组紧凑的整数,同样,inode 编号层必须跟踪哪些 inode 编号正在使用以及哪些可以分配。与块编号分配一样,不同的实现使用各种表示法来表示空闲 inode 列表,并提供分配和释放 inode 的调用。在最简单的实现中,inode 包含一个字段来记录它是否空闲。
Name discovery: inode numbers, like disk block numbers, are a compact set of integers, and again the inode number layer must keep track of which inode numbers are in use and which are free to be assigned. As with block number assignment, different implementations use various representations for a list of free inodes and provide calls to allocate and deallocate inodes. In the simplest implementation, the inode contains a field recording whether or not it is free.
通过将这三层放在一起,我们得到以下程序:
By putting these three layers together, we obtain the following procedure:
1 过程 INODE_NUMBER_TO_BLOCK (整数 偏移量,整数 inode_number )
1 procedure INODE_NUMBER_TO_BLOCK (integer offset, integer inode_number)
2 返回 块
2 returns block
3 inode 实例 i ← INODE_NUMBER_TO_INODE ( inode_number )
3 inode instance i ← INODE_NUMBER_TO_INODE (inode_number)
4 o ←偏移量/块大小
4 o ← offset / BLOCKSIZE
5 b ←索引到块号(i,o)
5 b ← INDEX_TO_BLOCK_NUMBER (i, o)
6 返回 BLOCK_NUMBER_TO_BLOCK(b)
6 return BLOCK_NUMBER_TO_BLOCK (b)
此过程返回包含inode_number命名的文件中位于offset处的字节的块。此过程遍历三层命名。存储块有数字、属于 inode 的块的编号索引以及 inode 的数字。
This procedure returns the block that contains the byte at offset in the file named by inode_number. This procedure traverses three layers of naming. There are numbers for storage blocks, numbered indexes for blocks belonging to an inode, and numbers for inodes.
数字是计算机使用的便捷名称(数字可以存储在固定长度的字段中,这简化了存储分配),但对于人们使用来说却不方便(数字的助记价值很小)。此外,块和 inode 编号指定一个位置,因此如果需要重新排列物理存储,则数字必须更改,这对人们来说又很不方便。UNIX文件系统通过插入命名层来解决这个问题,该命名层的唯一目的是隐藏文件管理的元数据。在此层之上是用户友好的持久对象命名方案 - 文件和输入/输出设备。此命名方案同样有几层。持久对象命名方案中最明显的组件是目录。在UNIX文件系统中,目录是一个上下文,包含一组字符串名称和 inode 编号之间的绑定。
Numbers are convenient names for use by a computer (numbers can be stored in fixed-length fields that simplify storage allocation) but are inconvenient names for use by people (numbers have little mnemonic value). In addition, block and inode numbers specify a location, so if it becomes necessary to rearrange the physical storage, the numbers must change, which is again inconvenient for people. The UNIX file system deals with this problem by inserting a naming layer whose sole purpose is to hide the metadata of file management. Above this layer is a user-friendly naming scheme for durable objects—files and input/output devices. This naming scheme again has several layers. The most visible component of the durable object naming scheme is the directory. In the UNIX file system, a directory is a context containing a set of bindings between character-string names and inode numbers.
要创建文件,UNIX文件系统会分配一个 inode,初始化其元数据,并将建议的名称绑定到某个目录中的 inode。写入文件时,文件系统会将块分配给 inode。
To create a file, the UNIX file system allocates an inode, initializes its metadata, and binds the proposed name to that inode in some directory. As the file is written, the file system allocates blocks to the inode.
默认情况下,UNIX文件系统会将文件添加到当前工作目录。当前工作目录是对活动应用程序正在工作的目录的上下文引用。上下文引用的形式只是另一个 inode 编号。如果wd是包含正在运行的程序(在UNIX系统中称为进程)的工作目录的状态变量的名称,则可以通过将wd作为第二个参数提供给以下过程来查找刚创建的文件的 inode 编号:
By default, the UNIX file system adds the file to the current working directory. The current working directory is a context reference to the directory in which the active application is working. The form of the context reference is just another inode number. If wd is the name of the state variable that contains the working directory for a running program (called a process in the UNIX system), one can look up the inode number of the just-created file by supplying wd as the second argument to a procedure such as:
过程 NAME_TO_INODE_NUMBER (字符串文件 名,整数 目录)返回整数
procedure NAME_TO_INODE_NUMBER (character string filename, integer dir) returns integer
返回 LOOKUP (文件名,目录)
return LOOKUP (filename, dir)
过程CHDIR(其实现我们将在后面描述)允许进程设置wd。
The procedure CHDIR, whose implementation we describe later, allows a process to set wd.
为了表示目录,UNIX文件系统重用了迄今为止开发的机制:它将目录表示为文件。按照惯例,表示目录的文件包含一个将文件名映射到 inode 编号的表。例如,图 2.21是一个具有两个文件名(“program”和“paper”)的目录,它们分别映射到 inode 编号 10 和 12。在版本 6 中,名称的最大长度为 14 个字节,表中的条目长度固定为 16 个字节(名称为 14 个字节,inode 编号为 2 个字节)。现代UNIX文件系统允许使用可变长度的名称,并且表表示更加复杂。
To represent a directory, the UNIX file system reuses the mechanisms developed so far: it represents directories as files. By convention, a file that represents a directory contains a table that maps file names to inode numbers. For example, Figure 2.21 is a directory with two file names (“program” and “paper”), which are mapped to inode numbers 10 and 12, respectively. In Version 6, the maximum length of a name is 14 bytes, and the entries in the table have a fixed length of 16 bytes (14 for the name and 2 for the inode number). Modern UNIX file systems allow for variable-length names, and the table representation is more sophisticated.
图 2.21目录。
Figure 2.21 A directory.
为了记录 inode 是用于目录还是文件,unix 文件系统使用类型字段扩展了 inode:
To record whether an inode is for a directory or a file, the unix file system extends the inode with a type field:
结构 inode
structure inode
整数 block_numbers [ N ] // 构成文件的块的编号
integer block_numbers[N] // the numbers of the blocks that constitute the file
整数 大小 //文件的大小(以字节为单位)
integer size // the size of the file in bytes
整数 类型 //文件类型:常规文件、目录、...
integer type // type of file: regular file, directory, …
MKDIR创建一个零长度文件(目录)并将类型设置为DIRECTORY。 稍后引入的扩展将为类型添加其他值。
MKDIR creates a zero-length file (directory) and sets type to DIRECTORY. Extensions introduced later will add additional values for type.
With this representation of directories and inodes, LOOKUP is as follows:
1 过程 LOOKUP (字符串文件 名,整数 目录)返回整数
1 procedure LOOKUP (character string filename, integer dir) returns integer
2 块 实例 b
2 block instance b
3 inode 实例 i ← INODE_NUMBER_TO_INODE ( dir )
3 inode instance i ← INODE_NUMBER_TO_INODE (dir)
4 如果 i.type ≠ DIRECTORY则返回 FAILURE
4 if i.type ≠ DIRECTORYthen return FAILURE
5 表示 偏移量 从0到i.size - 1
5 for offset from 0 to i.size – 1 do
6 b ← INODE_NUMBER_TO_BLOCK (偏移量,目录)
6 b ← INODE_NUMBER_TO_BLOCK (offset, dir)
7 如果 STRING_MATCH (文件名, b )那么
7 if STRING_MATCH (filename, b) then
8 返回 INODE_NUMBER (文件名,b )
8 return INODE_NUMBER (filename, b)
9 偏移量←偏移量+块大小
9 offset ← offset + BLOCKSIZE
10 返回 失败
10 return FAILURE
LOOKUP读取包含目录dir的数据的块并在目录的数据中搜索字符串filename 。它计算目录第一个块的块号(第6行),然后过程STRING_MATCH(未显示代码)在该块中搜索名称为filename的条目。如果有条目,INODE_NUMBER(未显示代码)返回条目中的 inode 号(第8行)。如果没有条目,LOOKUP计算第二个块的块号,依此类推,直到搜索完目录的所有块。如果所有块都不包含filename的条目,LOOKUP将返回错误(第10行)。例如,调用LOOKUP(“program”,dir),其中dir是图 2.21中目录的 inode 号,将返回 inode 号 10。
LOOKUP reads the blocks that contain the data for the directory dir and searches for the string filename in the directory’s data. It computes the block number for the first block of the directory (line 6) and the procedure STRING_MATCH (no code shown) searches that block for an entry for the name filename. If there is an entry, INODE_NUMBER (no code shown) returns the inode number in the entry (line 8). If there is no entry, LOOKUP computes the block number for the second block, and so on, until all blocks of the directory have been searched. If none of the blocks contain an entry for filename, LOOKUP returns an error (line 10). As an example, an invocation of LOOKUP (“program”, dir), where dir is the inode number for the directory of Figure 2.21, would return the inode number 10.
将所有文件放在一个目录中会使用户难以跟踪大量文件。枚举大型目录的内容将生成一个很长的列表,该列表的组织方式最多是简单排列(例如按字母顺序排列)。为了允许对用户文件进行任意分组,UNIX文件系统允许用户创建命名目录。
Having all files in a single directory makes it hard for users to keep track of large numbers of files. Enumerating the contents of a large directory would generate a long list that is organized simply (e.g., alphabetically) at best. To allow arbitrary groupings of user files, the UNIX file system permits users to create named directories.
目录可以像文件一样命名,但用户还需要一种命名目录中文件的方法。解决方案是向文件名添加一些结构:例如“projects/paper”,其中“projects”命名目录,“paper”命名该目录中的文件。这些结构化名称就是路径名的示例。UNIX 文件系统使用斜线(正斜杠)作为路径名组件的分隔符;其他系统选择不同的分隔符,例如句点、反斜杠或冒号。使用这些工具,用户可以创建目录和文件的层次结构。
A directory can be named just like a file, but the user also needs a way of naming the files in that directory. The solution is to add some structure to file names: for example, “projects/paper”, in which “projects” names a directory and “paper” names a file in that directory. Structured names such as these are examples of path names. The UNIX file system uses a virgule (forward slash) as a separator of the components of a path name; other systems choose different separator characters such as period, back slash, or colon. With these tools, users can create a hierarchy of directories and files.
路径名的名称解析算法可以通过在先前的目录查找过程中添加递归过程来实现:
The name-resolving algorithm for path names can be implemented by layering a recursive procedure over the previous directory lookup procedure:
1 过程 PATH_TO_INODE_NUMBER (字符串 路径,整数 目录)返回整数
1 procedure PATH_TO_INODE_NUMBER (character string path, integer dir) returns integer
2 如果(PLAIN_NAME(路径))返回 NAME_TO_INODE_NUMBER(路径,目录)
2 if (PLAIN_NAME (path)) return NAME_TO_INODE_NUMBER (path, dir)
3 其他
3 else
4 dir ← LOOKUP ( FIRST (路径), dir )
4 dir ← LOOKUP (FIRST (path), dir)
5 路径← REST (路径)
5 path ← REST (path)
6 返回 PATH_TO_INODE_NUMBER (路径,目录)
6 return PATH_TO_INODE_NUMBER (path, dir)
函数PLAIN_NAME ( path )扫描其参数以查找UNIX标准路径名分隔符(正斜杠),如果未找到则返回TRUE 。如果没有分隔符,程序会将简单名称解析为所请求目录中的 inode 编号(第2行)。如果path中有分隔符,程序会将其视为路径名并开始处理(第4行至第 6行)。函数FIRST从路径中剥离第一个组件名称,REST返回路径名的其余部分。因此,例如,调用PATH_TO_NAME (“projects/paper”, wd ) 会导致递归调用PATH_TO_NAME (“paper”, dir ),其中dir是目录“projects”的 inode 编号。
The function PLAIN_NAME ( path ) scans its argument for the UNIX standard path name separator (forward slash) and returns TRUE if it does not find one. If there is no separator, the program resolves the simple name to an inode number in the requested directory (line 2). If there is a separator in path, the program takes it to be a path name and goes to work on it (lines 4 through 6). The function FIRST peels off the first component name from the path, and REST returns the remainder of the path name. Thus, for example, the call PATH_TO_NAME (“projects/paper”, wd) results in the recursive call PATH_TO_NAME (“paper”, dir), where dir is the inode number for the directory “projects”.
对于路径名,通常需要输入包含许多部分的名称。为了解决这个麻烦,UNIX文件系统支持更改目录过程CHDIR,允许进程设置其工作目录:
With path names, one often has to type names with many components. To address this annoyance, the UNIX file system supports a change directory procedure, CHDIR, allowing a process to set its working directory:
过程 CHDIR (路径 字符串)
procedure CHDIR (path character string)
wd ← PATH_TO_INODE_NUMBER (路径, wd )
wd ← PATH_TO_INODE_NUMBER (path, wd)
当一个进程启动时,它会从创建该进程的父进程继承工作目录。
When a process starts, it inherits the working directory from the parent process that created this process.
要引用当前工作目录以外的目录中的文件,仍然需要输入长名称。例如,当我们在目录“projects”中工作时(在调用CHDIR(“projects”)之后),我们可能必须经常引用文件“Mail/inbox/new-assignment”。为了解决这个烦恼,UNIX文件系统支持称为链接的同义词。在示例中,我们可能想要在当前工作目录“projects”中为此文件创建一个链接。使用以下参数调用LINK过程:
To refer to files in directories other than the current working directory still requires typing long names. For example, while we are working in the directory “projects”—after calling CHDIR (“projects”)—we might have to refer often to the file “Mail/inbox/new-assignment”. To address this annoyance, the UNIX file system supports synonyms known as links. In the example, we might want to create a link for this file in the current working directory, “projects”. Invoking the LINK procedure with the following arguments:
LINK (“邮件/收件箱/新任务”,“任务”)
LINK (“Mail/inbox/new-assignment”, “assignment”)
如果“assignment”尚不存在,则将“assignment”设为“projects”中“Mail/inbox/new-assignment”的同义词。(如果存在,LINK将返回错误,指出“assignment”已存在。)使用链接,目录层次结构将从严格的层次结构变为有向图。(UNIX文件系统只允许链接到文件,而不允许链接到目录,因此该图不仅是有向的,而且是无环的。我们稍后将看到原因。)
makes “assignment” a synonym for “Mail/inbox/new-assignment” in “projects”, if “assignment” doesn’t exist yet. (If it does, LINK will return an error saying “assignment” already exists.) With links, the directory hierarchy turns from a strict hierarchy into a directed graph. (The UNIX file system allows links only to files, not to directories, so the graph is not only directed but acyclic. We will see why in a moment.)
UNIX文件系统将链接简单地实现为不同上下文中的绑定,将不同的文件名映射到相同的 inode 编号;因此,链接不需要对迄今为止开发的命名方案进行任何扩展。例如,如果“new-assignment”的 inode 编号为 481,则目录“Mail/inbox”包含一个条目 {“new-assignment”, 481},执行上述命令后,目录“projects”包含一个条目 {“assignment”, 481}。用UNIX系统术语来说,“projects/assignment”现在链接到“Mail/inbox/new-assignment”。
The UNIX file system implements links simply as bindings in different contexts that map different file names to the same inode number; thus, links don’t require any extension to the naming scheme developed so far. For example, if the inode number for “new-assignment” is 481, then the directory “Mail/inbox” contains an entry {“new-assignment”, 481} and after the above command is executed the directory “projects” contains an entry {“assignment”, 481}. In UNIX system jargon, “projects/assignment” is now linked to “Mail/inbox/new-assignment”.
当不再需要某个文件时,进程可以使用UNLINK ( filename ) 删除该文件,向文件系统指示名称filename不再使用。UNLINK会从包含filename的目录中删除filename与其 inode 编号的绑定。 如果此绑定是最后一个包含 inode 编号的绑定,则文件系统还会将filename的 inode 和filename的 inode 块放入空闲列表中。
When a file is no longer needed, a process can remove a file using UNLINK (filename), indicating to the file system that the name filename is no longer in use. UNLINK removes the binding of filename to its inode number from the directory that contains filename. The file system also puts filename’s inode and the blocks of filename’s inode on the free list if this binding is the last one containing the inode’s number.
在我们添加链接之前,文件仅与一个目录中的名称绑定,因此如果进程要求从该目录中删除该名称,文件系统也可以删除该文件。但是现在已添加链接,当进程要求删除名称时,其他目录中可能仍有名称与该文件绑定,在这种情况下不应删除该文件。这引发了一个问题,何时应删除文件?当进程删除文件的最后一个绑定时,UNIX文件系统会删除文件。UNIX 文件系统通过在 inode 中保留引用计数来实现此策略:
Before we added links, a file was bound to a name in only one directory, so if a process asks to delete the name from that directory, the file system can also delete the file. But now that links have been added, when a process asks to delete a name, there may still be names in other directories bound to the file, in which case the file shouldn’t be deleted. This raises the question, when should a file be deleted? The UNIX file system deletes a file when a process removes the last binding for a file. The UNIX file system implements this policy by keeping a reference count in the inode:
结构 inode
structure inode
整数块 号[ N ]
integer block_numbers[N]
整数 大小
integer size
整数 类型
integer type
整数 引用
integer refcnt
每当它与一个 inode 进行绑定时,文件系统就会增加该 inode 的引用计数。为了删除文件,UNIX文件系统提供了一个UNLINK ( filename ) 过程,该过程删除filename指定的绑定。同时,文件系统将相应 inode 中的引用计数减一。如果减少导致引用计数变为零,则表示不再有与该 inode 的绑定,因此文件系统可以释放该 inode 及其对应的块。例如,UNLINK (“Mail/inbox/new-assignment”) 会删除目录“Mail/inbox”中的目录条目“new-assignment”,但不会删除“assignment”,因为取消链接后inode 481 中的refcnt将为 1。只有调用UNLINK (“assignment”) 后,inode 481 及其块才会被释放。
Whenever it makes a binding to an inode, the file system increases the reference count of that inode. To delete a file, the UNIX file system provides an UNLINK(filename) procedure, which deletes the binding specified by filename. At the same time the file system decreases the reference count in the corresponding inode by one. If the decrease causes the reference count to go to zero, that means there are no more bindings to this inode, so the file system can free the inode and its corresponding blocks. For example, UNLINK (“Mail/inbox/new-assignment”) removes the directory entry “new-assignment” in the directory “Mail/inbox”, but not “assignment”, because after the unlink the refcnt in inode 481 will be 1. Only after calling UNLINK (“assignment”) will the inode 481 and its blocks be freed.
仅当命名图中没有循环时,使用引用计数才有效。为确保UNIX命名网络是没有循环的有向图,UNIX文件系统禁止链接到目录。为了解为何要避免循环,考虑目录“a”,其中包含目录“b”。如果程序在包含“a”的目录中调用LINK(“a/b/c”,“a”),则系统将返回错误并且不执行该操作。如果系统执行了此操作,它将创建从“c”到“a”的循环,并将“a”的 inode 中的引用计数增加一。如果程序随后调用UNLINK(“a”),则名称“a”将被删除,但“a”的 inode 和块不会被删除,因为“a”的 inode 中的引用计数仍然为正(因为从“c”到“a”的链接)。但是一旦删除名称“a”,用户将无法再命名目录“a”,也无法删除它。在这种情况下,目录“a”及其子目录将与命名图断开连接,但系统不会将其删除,因为“a”的 inode 中的引用计数仍然为正。可以检测到这种情况,例如通过使用垃圾收集,但这样做的成本很高。相反,设计人员选择了一个更简单的解决方案:不允许链接到目录,这排除了循环的可能性。
Using reference counts works only if there are no cycles in the naming graph. To ensure that the UNIX naming network is a directed graph without cycles, the UNIX file system forbids links to directories. To see why cycles are avoided, consider a directory “a”, which contains a directory “b”. If a program invokes LINK (“a/b/c”, “a”) in the directory that contains “a”, then the system would return an error and not perform the operation. If the system had performed this operation, it would have created a cycle from “c” to “a” and would have increased the reference count in the inode of “a” by one. If a program then invokes UNLINK (“a”), the name “a” is removed, but the inode and the blocks of “a” wouldn’t be removed because the reference count in the inode of “a” is still positive (because of the link from “c” to “a”). But once the name “a” would be removed, a user would no longer be able to name the directory “a” and wouldn’t be able to remove it either. In that case, the directory “a” and its subdirectories would be disconnected from the naming graph, but the system would not remove it because the reference count in the inode of “a” is still positive. It is possible to detect this situation, for example by using garbage collection, but it is expensive to do so. Instead, the designers chose a simpler solution: don’t allow links to directories, which rules out the possibility of cycles.
但是,有两种特殊情况。首先,默认情况下,每个目录都包含指向其自身的链接;UNIX文件系统为此保留了字符串“.”(一个点)。因此,名称“.”允许进程命名当前目录,而无需知道它是哪个目录。创建目录时,目录的 inode 具有两个引用计数:一个用于目录的 inode,一个用于链接“.”,因为它指向自身。由于“.”引入了长度为 0 的循环,因此在删除目录时不存在命名网络部分断开的风险。取消链接目录时,文件系统只会将目录的 inode 的引用计数减少 2。
There are two special cases, however. First, by default each directory contains a link to itself; the UNIX file system reserves the string “.” (a single dot) for this purpose. The name “.” thus allows a process to name the current directory without knowing which the directory it is. When a directory is created, the directory’s inode has a reference count of two: one for the inode of the directory and one for the link “.”, because it points to itself. Because “.” introduces a cycle of length 0, there is no risk that part of the naming network will become disconnected when removing a directory. When unlinking a directory, the file system just decreases the reference count of the directory’s inode by 2.
其次,默认情况下,每个目录还包含指向父目录的链接;文件系统为此保留了字符串“..”(两个连续的点)。名称“..”允许进程命名父目录,例如,通过调用CHDIR(“..”)向上移动文件层次结构。链接不会造成问题。只有当目录除了“.”和“..”之外没有其他条目时,才能将其删除。如果用户想要删除包含目录“b”的目录“a”,则文件系统会拒绝执行此操作,直到用户先删除“b”。此规则确保命名网络不会断开连接。
Second, by default, each directory also contains a link to a parent directory; the file system reserves the string “..” (two consecutive dots) for this purpose. The name “..” allows a process to name a parent directory and, for example, move up the file hierarchy by invoking CHDIR (“..”). The link doesn’t create problems. Only when a directory has no other entries than “.” and “..” can it be removed. If a user wants to remove a directory “a”, which contains a directory “b”, then the file system refuses to do so until the user first has removed “b”. This rule ensures that the naming network cannot become disconnected.
使用LINK和UNLINK,版本 6 实现RENAME ( from_name , to_name ) 如下:
Using LINK and UNLINK, Version 6 implemented RENAME (from_name, to_name) as follows:
1 取消链接(到名称)
1 UNLINK (to_name)
2 链接(来自名称,到名称)
2 LINK (from_name, to_name)
3 取消链接(来自名称)
3 UNLINK (from_name)
然而,这种实现有一个不良特性。程序经常使用RENAME将文件的工作副本更改为正式版本;例如,用户可能正在编辑文件“x”。文本编辑器实际上将所有更改都放到临时文件“#x”中。当用户保存文件时,编辑器会将临时文件“#x”重命名为“x”。
This implementation, however, has an undesirable property. Programs often use RENAME to change a working copy of a file into the official version; for example, a user may be editing a file “x”. The text editor actually makes all changes to a temporary file “#x”. When the user saves the file, the editor renames the temporary file “#x” to “x”.
使用LINK和UNLINK实现RENAME 的问题在于,如果计算机在步骤 1 和步骤 2 之间出现故障,然后重新启动,则名称to_name(在本例中为“x”)将丢失,这可能会让用户感到惊讶,因为他们不太可能知道该文件仍然存在,只是名称为“#x”。真正需要的是,通过单个原子操作将“#x”重命名为“x”,但这需要原子操作,这是第 9 章 [在线] 的主题。
The problem with implementing RENAME using LINK and UNLINK is that if the computer fails between steps 1 and 2 and then restarts, the name to_name (“x” in this case) will be lost, which is likely to surprise the user, who is unlikely to know that the file still exists but under the name “#x”. What is really needed is that “#x” be renamed to “x” in a single, atomic operation, but that requires atomic actions, which are the topic of Chapter 9 [on-line].
如果没有原子操作,则可以为RENAME实现以下稍弱的规范:如果to_name已经存在,则to_name的实例将始终存在,即使系统在RENAME中间失败。此规范足以让编辑器做正确的事情,并且是现代版本提供的。
Without atomic actions, it is possible to implement the following slightly weaker specification for RENAME: if to_name already exists, an instance of to_name will always exist, even if the system should fail in the middle of RENAME. This specification is good enough for the editor to do the right thing and is what modern versions provide.
现代版本实质上实现此规范如下:
Modern versions implement this specification in essence as follows:
1 链接(发件人名称,收件人名称)
1 LINK (from_name, to_name)
2 取消链接(来自名称)
2 UNLINK (from_name)
因为无法链接到已存在的名称,所以RENAME通过直接操作文件系统结构来实现这两个调用的效果。RENAME首先将to_name目录条目中的 inode 编号更改为磁盘上from_name的 inode 编号。 然后, RENAME删除from_name的目录条目。 如果文件系统在这两个步骤之间发生故障,则在恢复时,文件系统必须增加from_name的 inode中的引用计数,因为from_name和to_name都指向该 inode。 此实现确保如果to_name在调用RENAME之前存在,它将继续存在,即使计算机在RENAME期间发生故障。
Because one cannot link to a name that already exists, RENAME implements the effects of these two calls by manipulating the file system structures directly. RENAME first changes the inode number in the directory entry for to_name to the inode number for from_name on disk. Then, RENAME removes the directory entry for from_name. If the file system fails between these two steps, then on recovery the file system must increase the reference count in from_name’s inode because both from_name and to_name are pointing to the inode. This implementation ensures that if to_name exists before the call to RENAME, it will continue to exist, even if the computer fails during RENAME.
UNIX系统为每个用户提供了一个个人目录,称为用户主目录。当用户登录到UNIX系统时,它会启动一个命令解释器(称为shell),用户可以通过它与系统交互。shell 以工作目录( wd )启动,该目录设置为用户主目录的 inode 编号。通过上述过程,用户可以创建个人目录树来组织其主目录中的文件。
The UNIX system provides each user with a personal directory, called a user’s home directory. When a user logs on to a UNIX system, it starts a command interpreter (known as the shell) through which a user can interact with the system. The shell starts with the working directory (wd) set to the inode number of the user’s home directory. With the above procedures, users can create personal directory trees to organize the files in their home directory.
但是,拥有多个个人目录树并不允许用户与另一个用户共享文件。为此,一个用户需要一种引用属于另一个用户的文件名称的方法。实现此目的的最简单方法是将每个用户的名称绑定到该用户的顶级目录,在每个用户都可以使用的某些上下文中。但是,需要命名这个系统范围的上下文。通常,还需要其他系统范围的上下文,例如包含共享程序库的目录。为了用最少的附加机制满足这些需求,文件系统提供了一个通用上下文,称为根目录。根目录包含用户目录、包含程序库的目录以及任何其他广泛共享的目录的绑定。结果是系统的所有文件都集成到基于根的单个目录树(具有受限制的交叉链接)中。
But having several personal directory trees does not allow one user to share files with another. To do that, one user needs a way of referring to the names of files that belong to another user. The easiest way to accomplish that is to bind a name for each user to that user’s top-level directory, in some context that is available to every user. But then there is a requirement to name this systemwide context. Typically, there are needs for other systemwide contexts, such as a directory containing shared program libraries. To address these needs with a minimum of additional mechanisms, the file system provides a universal context, known as the root directory. The root directory contains bindings for the directory of users, the directory containing program libraries, and any other widely shared directories. The result is that all files of the system are integrated into a single directory tree (with restricted cross-links) based on the root.
这种设计留下了一个名称发现问题:用户如何命名根目录?回想一下,名称查找需要上下文引用(目录 inode 的名称),到目前为止,目录 inode 一直由工作目录状态变量提供。要实现根目录,文件系统只需将 inode 编号 1 声明为根目录的 inode。然后,任何用户都可以使用这个众所周知的名称作为起始上下文,在其中查找共享上下文或其他用户的名称(甚至可以查找自己的名称,以便在登录时设置工作目录)。
This design leaves a name discovery question: how can a user name the root directory? Recall that name lookup requires a context reference—the name of a directory inode—and until now that directory inode has been supplied by the working directory state variable. To implement the root, the file system simply declares inode number 1 to be the inode for the root directory. This well-known name can then be used by any user as the starting context in which to look up the name of a shared context, or another user (or even to look up one’s own name, to set the working directory when logging in).
文件系统实际上提供了两种方法来引用根目录中的内容。从系统中的任何目录开始,可以使用名称“..”来命名该目录的父目录,“../..”来命名该目录上方的目录,依此类推,直到到达根目录。用户可以知道已到达根目录,因为根目录中的“..”命名了根目录。也就是说,在根目录中,“.”和“..”都是指向根目录的链接。另一种方法是使用绝对路径名,在UNIX文件系统中,绝对路径名是以“/”开头的名称,例如“/Alice/Mail/inbox/new-assignment”。
The file system actually provides two ways to refer to things in the root directory. Starting from any directory in the system, one can use the name “..” to name that directory’s parent, “../..” to name the directory above that, and so on until the root directory is reached. A user can tell that the root directory is reached, because “..” in the root directory names the root directory. That is, in the root directory, both “.” and “..” are links to the root directory. The other way is with absolute path names, which in the UNIX file system are names that start with a “/”, for example, “/Alice/Mail/inbox/new-assignment”.
为了支持绝对路径名和相对路径名,我们需要在命名方案中增加一层:
To support absolute path names as well as relative path names, we need one more layer in the naming scheme:
1 过程 GENERALPATH_TO_INODE_NUMBER (字符串 路径)返回整数
1 procedure GENERALPATH_TO_INODE_NUMBER (character string path) returns integer
2 如果(路径[0] = “/”)返回 PATH_TO_INODE_NUMBER(路径,1)
2 if (path[0] = “/”) return PATH_TO_INODE_NUMBER(path, 1)
3 否则返回 PATH_TO_INODE_NUMBER ( path,wd )
3 else return PATH_TO_INODE_NUMBER(path, wd)
此时,我们已完成命名方案,该方案可用于命名和共享单个磁盘上的持久存储。例如,要找到与图 2.22中的信息对应的文件“/programs/pong.c”对应的块,我们首先找到 inode 表,该表从存储在超级块(本图中未显示,但请参见图2.20)中的块号(在我们的示例中为块 4)开始。从那里我们找到根 inode(已知为 inode 编号 1)。根 inode 包含块号,而块号又包含根目录的块;在图中,根从块号 14 开始。块 14 列出了根目录中的条目:“programs”由 inode 编号 7 命名。inode 表显示 inode 编号 7 的数据从块号 23 开始,其中包含“ programs ”目录的内容。文件“pong.c”以 inode 编号 9 命名。再次参考 inode 表,查看 inode 9 的存储位置,我们看到与 inode 9 相对应的数据从块编号 61 开始。简而言之,目录和文件经过精心布局,以便可以从根 inode 的已知位置开始找到所有信息。
At this point we have completed a naming scheme that allows us to name and share durable storage on a single disk. For example, to find the blocks corresponding to the file “/programs/pong.c” with the information in Figure 2.22, we start by finding the inode table, which starts at a block number (block 4 in our example) stored in the super block (not shown in this figure, but see Figure 2.20). From there we locate the root inode (which is known to be inode number 1). The root inode contains the block numbers that in turn contain the blocks of the root directory; in the figure the root starts in block number 14. Block 14 lists the entries in the root directory: “programs” is named by inode number 7. The inode table says that data for inode number 7 starts in block number 23, which contains the contents of the “programs” directory. The file “pong.c” is named by inode number 9. Referring once more to the inode table, to see where inode 9 is stored, we see that the data corresponding to inode 9 starts in block number 61. In short, directories and files are carefully laid out so that all information can be found by starting from the well-known location of the root inode.
图 2.22 UNIX文件系统的磁盘布局示例,通过关注 inode 表和数据块来细化图 2.20。inode表是一组连续的块,起始于超级块(未显示)中众所周知的地址。在此示例中,块 4、5 和 6 包含 inode 表,而块 7-61 包含目录和文件。根 inode 按照惯例是众所周知的 inode #1。通常,inode 小于块,因此在此示例中,每个块中有四个 inode。块 #14、#37 和 #16 构成根目录,而块 #23 是名为“/programs”的目录的四个块中的第一个,块 #61 是三块文件“/programs/pong.c”的第一个块。
Figure 2.22 Example disk layout for a UNIX file system, refining Figure 2.20 by focusing on the inode table and data blocks. The inode table is a group of contiguous blocks starting at a well-known address, found in the super block (not shown). In this example, blocks 4, 5, and 6 contain the inode table, while blocks 7–61 contain directories and files. The root inode is by convention the well-known inode #1. Typically, inodes are smaller than a block, so in this example there are four inodes in each block. Blocks #14, #37, and #16 constitute the root directory, while block #23 is the first of four blocks of the directory named “/programs”, and block #61 is the first block of the three-block file “/programs/pong.c”.
版本 6 中的默认根目录是 inode 1。版本 7 添加了一个调用CHROOT来更改进程的根目录。例如,可以通过将 Web 服务器的根目录更改为“/tmp”等,在UNIX名称空间的一角运行 Web 服务器。调用此调用后,Web 服务器的根目录对应于目录“/tmp”的 inode 编号,“/tmp”中的“..”是指向“/tmp”的链接。因此,服务器只能命名“/tmp”下的目录和文件。
The default root directory in Version 6 is inode 1. Version 7 added a call, CHROOT, to change the root directory for a process. For example, a Web server can be run in the corner of the UNIX name space by changing its root directory to, for example, “/tmp”. After this call, the root directory for the Web server corresponds to the inode number of the directory “/tmp” and “..” in “/tmp” is a link to “/tmp”. Thus, the server can name only directories and files below “/tmp”.
为了允许用户命名其他磁盘上的文件,UNIX文件系统支持将新磁盘附加到命名空间的操作。用户可以选择每个设备所附加的名称:例如,过程
To allow users to name files on other disks, the UNIX file system supports an operation to attach new disks to the name space. A user can choose the name under which each device is attached: for example, the procedure
挂载(“/dev/fd1”,“/flash”)
MOUNT (“/dev/fd1”, “/flash”)
将存储在名为“/dev/fd1”的物理设备上的目录树移植到目录“/flash”上。(此命令表明每个设备在我们描述的相同对象名称空间中也有一个名称;与设备相对应的文件通常包含有关设备本身的信息。)通常,挂载不会在关机后继续存在:重新启动后,用户必须明确重新挂载设备。将优雅的UNIX方法与 DOS 方法进行对比是很有趣的,在 DOS 方法中,设备由固定的单字符名称命名(例如“C:”)。
grafts the directory tree stored on the physical device named “/dev/fd1” onto the directory “/flash”. (This command demonstrates that each device also has a name in the same object name space we have been describing; the file corresponding to a device typically contains information about the device itself.) Typically mounts do not survive a shutdown: after a reboot, the user has to explicitly remount the devices. It is interesting to contrast the elegant UNIX approach with the DOS approach, in which devices are named by fixed one-character names (e.g., “C:”).
UNIX文件系统通过在内存中记录“flash”的 inode 来实现MOUNT ,即文件系统已安装在该 inode 上,并将该 inode 保留在内存中,直到至少执行相应的UNMOUNT 。在内存中,系统还会记录已安装在该 inode 上的文件系统的设备和根 inode 编号。此外,它还会在内存中记录“/dev/fd1”的 inode 版本,即其父 inode 是什么。
The UNIX file system implements MOUNT by recording in the in-memory inode for “flash” that a file system has been mounted on it and keeps this inode in memory until at least the corresponding UNMOUNT. In memory, the system also records the device and the root inode number of the file system that has been mounted on it. In addition, it records in the in-memory version of the inode for “/dev/fd1” what its parent inode is.
挂载点的信息全部记录在易失性存储器中,而不是磁盘上,并且不会在计算机故障后继续存在。发生故障后,系统管理员或程序必须再次调用MOUNT。支持MOUNT还需要更改文件名层:如果LOOKUP遇到挂载了文件系统的 inode,它会使用挂载文件系统的根 inode 进行查找。
The information for mount points is all recorded in volatile memory instead of on disk and doesn’t survive a computer failure. After a failure, the system administrator or a program must invoke MOUNT again. Supporting MOUNT also requires a change to the file name layer: if LOOKUP runs into an inode on which a file system is mounted, it uses the root inode of the mounted file system for the lookup.
UNMOUNT撤消挂载。
UNMOUNT undoes the mount.
对于已挂载的文件系统,同义词成为一个更困难的问题,因为每个已挂载的文件系统都有一个 inode 编号的地址空间。每个 inode 编号都有一个默认上下文:它所在的磁盘。因此,一个磁盘上的目录条目无法绑定到不同磁盘上的 inode 编号。这个问题可以通过多种方式解决,其中两种是:(1) 使 inode 在所有磁盘上唯一或 (2) 以不同的方式为其他磁盘上的文件创建同义词。UNIX系统选择第二种方法,即使用称为符号或软链接的间接名称,将一个文件名绑定到另一个文件名。大多数系统都使用方法 (2),因为尝试保持 inode 编号在全球范围内唯一、较小并且解析速度快会比较复杂。
With mounted file systems, synonyms become a more difficult problem because per mounted file system there is an address space of inode numbers. Every inode number has a default context: the disk on which it is located. Thus, there is no way for a directory entry on one disk to bind to an inode number on a different disk. This problem can be approached in several ways, two of which are: (1) make inodes unique across all disks or (2) create synonyms for files on other disks in a different way. The UNIX system chooses the second approach by using indirect names called symbolic or soft links, which bind a file name to another file name. Most systems use method (2) because of the complications that would be involved in trying to keep inode numbers universally unique, small in size, and fast to resolve.
使用过程SYMLINK,用户可以为同一文件系统中的文件或已挂载文件系统中的文件创建同义词。文件系统通过允许 inode 的类型字段为 SYMLINK 来实现过程SYMLINK ,该过程指示与 inode 关联的块是包含数据还是路径名:
Using the procedure SYMLINK, users can create synonyms for files in the same file system or for files in mounted file systems. The file system implements the procedure SYMLINK by allowing the type field of an inode to be a SYMLINK, which tells whether the blocks associated with the inode contain data or a path name:
结构 inode
structure inode
整数块 号[ N ]
integer block_numbers[N]
整数 大小
integer size
整数 类型 //inode 的类型:常规文件、目录、符号链接、...
integer type // Type of inode: regular file, directory, symbolic link, …
整数 引用
integer refcnt
如果type字段的值为SYMLINK ,那么数组blocks [ i ]中的数据实际上包含路径名的字符,而不是一组 inode 编号。
If the type field has value SYMLINK, then the data in the array blocks[i] actually contains the characters of a path name rather than a set of inode numbers.
软链接可以通过在GENERALPATH_TO_NODE_NUMBER上分层来实现:
Soft links can be implemented by layering them over GENERALPATH_TO_NODE_NUMBER:
1 过程 PATHNAME_TO_INODE (字符串文件 名)返回 inode
1 procedure PATHNAME_TO_INODE (character string filename) returns inode
2 inode 实例 i
2 inode instance i
3 inode_number ← GENERALPATH_TO_INODE_NUMBER (文件名)
3 inode_number ← GENERALPATH_TO_INODE_NUMBER (filename)
4 我← INODE_NUMBER_TO_INODE ( inode_number )
4 i ← INODE_NUMBER_TO_INODE (inode_number)
5 如果 i.type = SYMBOLIC那么
5 if i.type = SYMBOLICthen
6 i = GENERALPATH_TO_INODE_NUMBER(COERCE_TO_STRING(i.block_numbers))
6 i = GENERALPATH_TO_INODE_NUMBER (COERCE_TO_STRING (i.block_numbers))
7 返回 我
7 return i
PATHNAME_TO_INODE返回的值是filename所命名文件的 inode 的内容。该过程首先查找filename的 inode 编号。然后,它使用INODE_NUMBER_TO_INODE查找 inode 。如果 inode 指示此文件是符号链接,则该过程将文件数据的内容解释为路径名,并再次调用GENERALPATH_TO_INODE_NUMBER 。
The value returned by PATHNAME_TO_INODE is the contents of the inode for the file named by filename. The procedure first looks up the inode number for filename. Then, it looks up the the inode using INODE_NUMBER_TO_INODE. If the inode indicates that this file is a symbolic link, the procedure interprets the contents of data of the file as a path name and invokes GENERALPATH_TO_INODE_NUMBER again.
我们现在有两种类型的同义词。直接绑定到 inode 编号的称为硬链接,以将其与软链接区分开来。继续前面的示例,指向“Mail/inbox/new-assignment”的软链接将包含字符串“Mail/inbox/new-assignment”,而不是 inode 编号 481。软链接是间接名称的一个例子:它将名称绑定到同一名称空间中的另一个名称,而硬链接将名称绑定到 inode 编号,后者是下层名称空间中的名称。因此,软链接依赖于文件名“Mail/inbox/new-assignment”,如果用户更改文件的名称或删除文件,那么链接“projects/assignment”将最终成为悬空引用(第 3.1.6 节讨论悬空引用)。但由于它按名称而不是按 inode 编号链接,因此软链接可以指向不同磁盘上的文件。
We now have two types of synonyms. A direct binding to an inode number is called a hard link, to distinguish it from a soft link. Continuing an earlier example, a soft link to “Mail/inbox/new-assignment” would contain the string “Mail/inbox/new-assignment”, rather than the inode number 481. A soft link is an example of an indirect name: it binds a name to another name in the same name space, while a hard link binds a name to an inode number, which is a name in a lower-layer name space. As a result, the soft link depends on the file name “Mail/inbox/new-assignment” if the user changes the file’s name or deletes the file, then “projects/assignment”, the link, will end up as a dangling reference (Section 3.1.6 discusses dangling references). But because it links by name rather than by inode number, a soft link can point to a file on a different disk.
回想一下,UNIX系统禁止硬链接循环,以便它可以使用引用计数来检测何时可以安全地回收文件的磁盘空间。但是,您仍然可以使用软链接形成循环:例如,树中深层的名称可以命名树中高处的目录。生成的结构不再是有向无环图,而是完全通用的命名网络。使用软链接,程序甚至可以调用SYMLINK(“cycle”,“cycle”),为没有关联文件的文件名创建同义词!如果进程打开这样的文件,它将仅沿着链接链执行一定数量的步骤,然后报告错误,例如“软链接层级过多”。
Recall that the UNIX system forbids cycles of hard links, so that it can use reference counts to detect when it is safe to reclaim the disk space for a file. However, you can still form cycles with soft links: a name deep down in the tree can, for example, name a directory high up in the tree. The resulting structure is no longer a directed acyclic graph, but a fully general naming network. Using soft links, a program can even invoke SYMLINK (“cycle”, “cycle”), creating a synonym for a file name that doesn’t have a file associated with it! If a process opens such a file, it will follow the link chain only a certain number of steps before reporting an error such as “Too many levels of soft links”.
软链接还有另一个有趣的行为。假设工作目录是“/Scholarly/programs/www”,并且该工作目录包含一个名为“CSE499-web”的符号链接,指向“/Scholarly/CSE499/www”。以下调用
Soft links have another interesting behavior. Suppose that the working directory is “/Scholarly/programs/www” and that this working directory contains a symbolic link named “CSE499-web” to “/Scholarly/CSE499/www”. The following calls
CHDIR(“CSE499-web”)
CHDIR (“CSE499-web”)
CHDIR(“..”)
CHDIR (“..”)
让调用者留在“/Scholarly/CSE499”而不是回到用户开始的位置。原因是“..”在新的默认上下文“/Scholarly/CSE499/www”中解析,而不是在可能是预期的上下文“/Scholarly/programs/www”中解析。这种行为可能是可取的,也可能不是,但它是 UNIX 命名语义的直接结果;Plan 9 系统有一个不同的计划,*这也将在练习 3.2 和 3.3 中进行探讨。
leave the caller in “/Scholarly/CSE499” rather than back where the user started. The reason is that “..” is resolved in the new default context, “/Scholarly/CSE499/www”, rather than what might have been the intended context, “/Scholarly/programs/www”. This behavior may be desirable or not, but it is a direct consequence of the UNIX naming semantics; the Plan 9 system has a different plan,* which is also explored in exercises 3.2 and 3.3.
总之,UNIX对象命名方案的大部分功能都来自于其命名层。表 2.3重复了表 2.2,这次显示了每个层接口使用的名称、值、上下文和伪代码过程。(尽管我们已经检查了此表中的每一层,但我们演示的算法在某些情况下以表中未建议的方式跨层桥接。)一般的设计技术是为每个问题引入另一层命名,这是通过间接方式解耦模块的原理的应用。
In summary, much of the power of the UNIX object naming scheme comes from its layers of naming. Table 2.3 reprises Table 2.2, this time showing the name, value, context, and pseudocode procedure used at each layer interface. (Although we have examined each of the layers in this table, the algorithms we have demonstrated have in some cases bridged across layers in ways not suggested by the table.) The general design technique has been to introduce for each problem another layer of naming, an application of the principle decouple modules with indirection.
表 2.3 unix 命名层,以及每层的命名方案的详细信息。
Table 2.3 The unix naming layers, with details of the naming scheme of each layer.
在描述UNIX文件系统的结构时,我们看到了它如何实现CHDIR、MKDIR、LINK、UNLINK、RENAME、SYMLINK、MOUNT和UNMOUNT 。我们通过描述OPEN、READ、WRITE和CLOSE的实现来完成对文件系统 API 的描述。在描述它们的实现之前,我们先描述它们必须支持哪些功能。
In the process of describing how the UNIX file system is structured, we saw how it implements CHDIR, MKDIR, LINK, UNLINK, RENAME, SYMLINK, MOUNT, and UNMOUNT. We complete the description of the file system API by describing the implementation of OPEN, READ, WRITE, and CLOSE. Before describing their implementation, we describe what features they must support.
文件系统允许用户控制谁有权访问他们的文件。文件所有者可以指定其他用户以何种权限访问该文件。例如,所有者可以指定其他用户只有读取文件的权限,而没有写入文件的权限。OPEN必须检查调用者是否具有适当的权限。作为一种复杂操作,一个文件可以由一组用户拥有。第 11 章 [在线] 详细讨论了安全性,因此我们在此略过细节。
The file system allows users to control who has access to their files. An owner of a file can specify with what permissions other users can make accesses to the file. For example, the owner may specify that other users have permission only to read a file but not to write it. OPEN must check whether the caller has the appropriate permissions. As a sophistication, a file can be owned by a group of users. Chapter 11 [on-line] discusses security in detail, so we will skip the details here.
文件系统会记录时间戳,以记录上次访问、上次修改文件以及上次更改文件 inode 的日期和时间。此信息对于增量备份等程序非常重要,因为这些程序必须确定自上次备份运行以来哪些文件已发生更改。文件系统过程必须更新这些值。例如,READ更新上次访问时间,WRITE更新上次修改时间和更改时间,LINK更新上次更改时间。
The file system records timestamps that capture the date and time of the last access, last modification to a file, and last change to a file’s inode. This information is important for programs such as incremental backup, which must determine which files have changed since the last time backup ran. The file system procedures must update these values. For example, READ updates last access time, WRITE updates last modification time and change time, and LINK updates last change time.
OPEN返回文件的简称,称为文件描述符( fd ),READ、WRITE和CLOSE使用它来命名文件。每个进程都以三个打开的文件开始:“标准输入”(文件描述符 0)、“标准输出”(文件描述符 1)和“标准错误”(文件描述符 2)。文件描述符可以命名键盘设备、显示设备或磁盘上的文件;程序不需要知道。这种设置允许设计人员在开发程序时不必担心程序的输入来自何处以及程序的输出要到哪里;程序只需从文件描述符 0 读取并写入文件描述符 1。
OPEN returns a short name for a file, called a file descriptor (fd), which READ, WRITE, and CLOSE use to name the file. Each process starts with three open files: “standard in” (file descriptor 0), “standard out” (file descriptor 1), and “standard error” (file descriptor 2). A file descriptor may name a keyboard device, a display device, or a file on disk; a program doesn’t need to know. This setup allows a designer to develop a program without having to worry about where the program’s input is coming from and where the program’s output is going to; the program just reads from file descriptor 0 and writes to file descriptor 1.
多个进程可以并发使用一个文件(例如,多个进程可能会写入显示设备)。如果多个进程打开同一个文件,则它们的READ和WRITE操作对该文件有自己的文件游标。如果一个进程打开一个文件,然后将该文件的文件描述符传递给另一个进程,则这两个进程共享该文件的游标。后一种情况很常见,因为在UNIX系统中,当一个进程(父进程)启动另一个进程(子进程)时,子进程会从父进程继承所有打开的文件描述符。例如,这种设计允许父进程和子进程正确地共享一个公共的输出文件。例如,如果子进程在父进程写入输出文件之后写入该文件,则子进程的输出将出现在父进程的输出之后,因为它们共享游标。
Several processes can use a file concurrently (e.g., several processes might write to the display device). If several processes open the same file, their READ and WRITE operations have their own file cursor for that file. If one process opens a file, and then passes the file descriptor for that file to another process, then the two processes share the cursor of the file. This latter case is common because in the UNIX system when one process (the parent) starts another process (the child), the child inherits all open file descriptors from the parent. This design allows the parent and child, for instance, to share a common output file correctly. If the child writes to the output file, for example, after the parent has written to it, the output of the child appears after the output of the parent because they share the cursor.
如果一个进程打开了一个文件,而另一个进程删除了指向该文件的最后一个名称,则直到第一个进程调用CLOSE时,才会释放 inode 。
If one process has a file open and another process removes the last name pointing to that file, the inode isn’t freed until the first process calls CLOSE.
为了支持这些功能,inode 进行了如下扩展:
To support these features, the inode is extended as follows:
结构 inode
structure inode
整数 block_numbers [ N ] // 构成文件的块数
integer block_numbers[N] // the number of blocks that constitute the file
整数 大小 //文件的大小(以字节为单位)
integer size // the size of the file in bytes
整数 类型 //文件类型:常规文件,目录,符号链接
integer type // type of file: regular file, directory, symbolic link
整数 refcnt // 此 inode 的名称数量
integer refcnt // count of the number of names for this inode
integer userid // 拥有此 inode 的用户 ID
integer userid // the user ID that owns this inode
整数 groupid // 拥有此 inode 的组 ID
integer groupid // the group ID that owns this inode
整数 模式 // inode 的权限
integer mode // inode’s permissions
整数 atime // 上次访问的时间(读、写,...)
integer atime // time of last access (READ, WRITE, … )
整数 mtime //最后修改的时间
integer mtime // time of last modification
整数 ctime // inode 最后一次改变的时间
integer ctime // time of last change of inode
为了实现OPEN、READ、WRITE和CLOSE,文件系统在内存中保留了几个表:一个文件表(file_table)和每个进程的文件描述符表(fd_table)。文件表记录了进程已打开的文件的信息(即,OPEN成功,但尚未调用CLOSE 的文件)。对于每个打开的文件,此信息包括文件的 inode 编号、文件游标以及记录打开该文件的进程数的引用计数。文件描述符表记录每个文件描述符在文件表中的索引。由于文件的游标存储在 file_table而不是fd_table中,因此子进程可以与其父进程共享继承文件的游标。
To implement OPEN, READ, WRITE, and CLOSE, the file system keeps in memory several tables: one file table (file_table) and for each process a file descriptor table (fd_table). The file table records information for the files that processes have open (i.e., files for which OPEN was successful, but for which CLOSE hasn’t been called yet). For each open file, this information includes the inode number of the file, its file cursor, and a reference count recording how many processes have the file open. The file descriptor table records for each file descriptor the index into the file table. Because a file’s cursor is stored in the file_table instead of the fd_table, children can share the cursor for an inherited file with their parent.
利用这些信息,OPEN实现如下:
With this information, OPEN is implemented as follows:
1 程序 OPEN(字符串文件 名,标志,模式)
1 procedure OPEN (character string filename, flags, mode)
2 inode_number ← PATH_TO_INODE_NUMBER (文件名, wd )
2 inode_number ← PATH_TO_INODE_NUMBER (filename, wd)
3 if inode_number = FAILURE and flags = O_CREATE then // 创建文件?
3 if inode_number = FAILUREand flags = O_CREATEthen // Create the file?
4 inode_number ← CREATE ( filename, mode ) // 是的,创建它。
4 inode_number ← CREATE (filename, mode) // Yes, create it.
5 如果 inode_number = FAILURE那么
5 if inode_number = FAILUREthen
6 返回 失败
6 return FAILURE
7 inode ← INODE_NUMBER_TO_INODE ( inode_number )
7 inode ← INODE_NUMBER_TO_INODE (inode_number)
8 if PERMITTED ( inode , flags ) then //此用户是否具有所需的权限
8 if PERMITTED (inode, flags) then // Does this user have the required permissions
9 文件索引←插入(文件表,inode编号)
9 file_index ← INSERT (file_table, inode_number)
10 fd ← FIND_UNUSED_ENTRY ( fd_table ) // 在文件描述符表中查找条目
10 fd ← FIND_UNUSED_ENTRY (fd_table) // Find entry in file descriptor table
11 fd_table [ fd ] ← file_index // 记录文件描述符的文件索引
11 fd_table[fd] ← file_index // Record file index for file descriptor
12 返回 fd // 返回fd
12 return fd // Return fd
13 else return FAILURE //否,返回失败
13 else return FAILURE // No, return a failure
第2行查找文件filename的 inode 编号。如果文件不存在,但是调用者想要创建文件(如标志O_CREATE所示) (第3行),则OPEN调用CREATE,后者分配一个 inode、初始化它并返回其 inode 编号(第4行)。如果文件不存在(即使在尝试创建之后仍然存在),OPEN将返回一个表示失败的值(第6行)。第7行定位 inode。第8行使用 inode 中的信息检查调用者是否有权打开文件;该检查在第 11.6.3.4 节 [在线] 中详细描述。如果有,则第9行在文件表中为 inode 编号创建一个新条目,并将该条目的文件光标设置为零,将引用计数设置为 1。第10行查找第一个未使用的文件描述符,记录其文件索引,然后将该文件描述符返回给调用者(第 10至12行)。否则,它将返回一个表示失败的值(第13行)。
Line 2 finds the inode number for the file filename. If the file doesn’t exist, but the caller wants to create the file as indicated by the flag O_CREATE (line 3), OPEN calls CREATE, which allocates an inode, initializes it, and returns its inode number (line 4). If the file doesn’t exist (even after trying to create it), OPEN returns a value indicating a failure (line 6). Line 7 locates the inode. Line 8 uses the information in the inode to check if the caller has permission to open the file; the check is described in detail in Section 11.6.3.4 [on-line]. If so, line 9 creates a new entry for the inode number in the file table and sets the entry’s file cursor to zero and reference count to 1. Line 10 finds the first unused file descriptor, records its file index, and returns the file descriptor to the caller (lines 10 through 12). Otherwise, it returns a value indicating a failure (line 13).
如果一个进程启动了另一个进程,子进程将继承父进程打开的文件描述符。也就是说,父进程fd_table中每个已使用条目的信息都会复制到子进程fd_table中相同编号的条目中。因此,fd_table中的父进程和子进程条目将指向file_table中的相同条目,从而导致父进程和子进程共享游标。
If a process starts another process, the child process inherits the open file descriptors of the parent. That is, the information in every used entry in the parent’s fd_table is copied to the same numbered entry in the child’s fd_table. As a result, the parent and child entries in the fd_table will point to the same entry in the file_table, resulting in the cursor being shared between parent and child.
READ实现如下:
READ is implemented as follows:
1 过程 READ ( fd ,字符数组引用 buf , n )
1 procedure READ (fd, character array reference buf, n)
2 文件索引←文件描述符表[文件描述符]
2 file_index ← fd_table[fd]
3 游标←文件表[文件索引].游标
3 cursor ← file_table[file_index].cursor
4 inode ← INODE_NUMBER_TO_INODE (文件表[文件索引]. inode_number )
4 inode ← INODE_NUMBER_TO_INODE (file_table[file_index].inode_number)
5 m =最小值(inode.size -光标,n)
5 m = MINIMUM (inode.size – cursor, n)
6 inode的atime ← NOW ()
6 atime of inode ← NOW ()
7 如果 m = 0则返回 END_OF_FILE
7 if m = 0 then return END_OF_FILE
8 对于 i 从 0 到 m – 1 执行{
8 for i from 0 to m – 1 do {
9 b ← INODE_NUMBER_TO_BLOCK ( i,inode_number )
9 b ← INODE_NUMBER_TO_BLOCK (i, inode_number)
10 复制(b,buf,最小值(m - i,块大小))
10 COPY (b, buf, MINIMUM (m – i, BLOCKSIZE))
11 i ← i +最小值(m – i,块大小)
11 i ← i + MINIMUM (m – i, BLOCKSIZE)
12 文件表[文件索引].光标←光标+ m
12 file_table[file_index].cursor ← cursor + m
13 返回 m
13 return m
第 2行和第 3 行使用文件索引来查找文件的游标。第4行定位 inode。第5行和第 6行计算READ可以读取多少字节并更新上次访问时间。如果文件中没有剩余字节,READ将返回一个表示文件结尾的值。第8行到第 12行将文件块中的字节复制到调用者的buf中。第13行更新游标。
Lines 2 and 3 use the file index to find the cursor for the file. Line 4 locates the inode. Line 5 and 6 compute how many bytes READ can read and updates the last access time. If there are no bytes left in the file, READ returns a value indicating end of file. Lines 8 through 12 copy the bytes from the file’s blocks into the caller’s buf. Line 13 updates the cursor.
可以为READ设计一个更复杂的命名方案,例如,允许通过关键字而不是偏移量进行命名。数据库系统通常通过将数据表示为由关键字索引的结构化记录来实现此类命名方案。但为了保持其设计简单,UNIX文件系统将其文件表示限制为字节的线性数组。
One could design a more sophisticated naming scheme for READ that, for example, allowed naming by keywords rather than by offsets. Database systems typically implement such naming schemes by representing the data as structured records that are indexed by keywords. But in order to keep its design simple, the UNIX file system restricts its representation of a file to a linear array of bytes.
WRITE的实现与READ类似。主要区别在于它将buf复制到 inode 的块中,根据需要分配新块,并更新 inode 的大小和mtime。
The implementation of WRITE is similar to READ. The major differences are that it copies buf into the blocks of the inode, allocating new blocks as necessary, and that it updates the inode’s size and mtime.
CLOSE释放文件描述符表中的条目并减少文件表中条目的引用计数。如果没有其他进程共享此条目(即引用计数已达到零),它还会释放文件表中的条目。如果文件表中没有其他条目使用此文件并且文件的 inode 中的引用计数已达到零(因为另一个进程取消了它的链接),则CLOSE释放 inode。
CLOSE frees the entry in the file descriptor table and decreases the reference count in entry in the file table. If no other processes are sharing this entry (i.e., the reference count has reached zero), it also frees the entry in the file table. If there are no other entries in the file table using this file and the reference count in the file’s inode has reached zero (because another process unlinked it), then CLOSE frees the inode.
与RENAME类似,其中一些操作需要多次磁盘写入才能完成。如果文件系统在某个操作中途发生故障(例如,由于断电),则某些磁盘写入可能已完成,而某些则可能未完成。此类故障可能导致磁盘上数据结构不一致。例如,磁盘上的空闲列表可能显示已分配某个块,但磁盘上的 inode 并未在其索引中记录该块。如果不对这种不一致采取任何措施,则该块实际上已丢失。问题集8探讨了这个问题以及一个简单的特殊情况解决方案。第 9 章 [在线] 探讨了系统解决方案。
Like RENAME, some of these operations require several disk writes to complete. If the file system fails (e.g., because the power goes off) in the middle of one of the operations, then some of the disk writes may have completed and some may not. Such a failure can cause inconsistencies among the on-disk data structures. For example, the on-disk free list may show that a block is allocated, but no on-disk inode records that block in its index. If nothing is done about this inconsistency, then that block is effectively lost. Problem set 8 explores this problem and a simple, special-case solution. Chapter 9 [on-line] explores systematic solutions.
版本 6(以及所有现代实现)维护最近使用的磁盘块的内存缓存。当文件系统需要一个块时,它首先检查缓存中是否有该块。如果该块存在,它将使用缓存中的块;否则,它将从存储设备中读取它。有了缓存,即使文件系统需要多次读取某个特定块,它也只会从存储设备中读取一次。由于从磁盘设备读取通常是一项昂贵的操作,因此缓存可以大大提高文件系统的性能。第 6 章详细讨论了缓存的实现以及如何使用缓存来提高文件系统的性能。
Version 6 (and all modern implementations) maintain an in-memory cache of recently used disk blocks. When the file system needs a block, it first checks the cache for the block. If the block is present, it uses the block from the cache; otherwise, it reads it from the storage device. With the cache, even if the file system needs to read a particular block several times, it reads that block from the storage device only once. Since reading from a disk device is often an expensive operation, the cache can improve the performance of the file system substantially. Chapter 6 discusses the implementation of caches in detail and how they can be used to improve the performance of a file system.
类似地,为了在修改文件的操作(例如WRITE )上实现高性能,文件系统将更新缓存中的文件块,但不会立即将修改后的文件 inode 和块强制写入存储设备。文件系统会将写入延迟到稍后,这样如果某个块被多次更新,它只会写入该块一次。因此,它可以在一次写入中合并多个更新(参见第 6.1.8 节)。
Similarly, to achieve high performance on operations that modify a file (e.g., WRITE), the file system will update the file’s blocks in the cache, but will not force the file’s modified inode and blocks to the storage device immediately. The file system delays the writes until later so that if a block is updated several times, it will write the block only once. Thus, it can coalesce many updates in one write (see Section 6.1.8).
如果某个进程想要确保写入操作的结果和 inode 更改传播到存储文件系统的设备,则必须调用FSYNC;UNIX规范要求,如果对文件的FSYNC调用返回,则对该文件的所有更改必须已写入存储设备。
If a process wants to ensure that the results of a write and inode changes are propagated to the device that stores the file system, it must call FSYNC; the UNIX specification requires that if an invocation of FSYNC for a file returns, all changes to the file must have been written to the storage device.
UNIX系统使用文件系统 API为用户实现了操作文件和名称空间的程序。这些程序包括文本编辑器(例如ed、vi和emacs)、rm(删除文件)、ls(列出目录内容)、mkdir(创建新目录)、rmdir(删除目录)、ln(创建链接名称)、cd(更改工作目录)和find(在目录树中搜索文件)。
Using the file system API, the UNIX system implements programs for users to manipulate files and name spaces. These programs include text editors (such as ed, vi, and emacs), rm (to remove a file), ls (to list a directory’s content), mkdir (to make a new directory), rmdir (to remove a directory), ln (to make link names), cd (to change the working directory), and find (to search for a file in a directory tree).
最有趣的UNIX程序之一是其命令解释器,称为“shell”。shell 说明了许多其他UNIX命名方案。假设用户想要编译名为“xc”的 C 源文件。UNIX惯例是通过附加表示文件类型的后缀来重载文件名,例如 C 源文件为“.c”。(有关重载的完整讨论可参见第 3.1.2 节。)用户向 shell 键入此命令:
One of the more interesting UNIX programs is its command interpreter, known as the “shell”. The shell illustrates a number of other UNIX naming schemes. Say a user wants to compile the C source file named “x.c”. The UNIX convention is to overload a file name by appending a suffix indicating the type of the file, such as “.c” for C source files. (A full discussion of overloading can be found in Section 3.1.2.) The user types this command to the shell:
ccxc
cc x.c
此命令由两个名称组成:程序的名称(编译器“cc”)和包含编译器要编译的源代码的文件的名称(“xc”)。shell 必须做的第一件事是找到我们要运行的程序“cc”。为此,UNIX命令解释器使用包含在名为PATH 的环境变量中的默认上下文引用。该环境变量包含一个上下文列表(在本例中为目录),用于在其中对名为“cc”的内容执行多次查找。假设查找成功,shell 将启动该程序,并使用参数“xc”调用它。
This command consists of two names: the name of a program (the compiler “cc”) and the name of a file containing source code (“x.c”) for the compiler to compile. The first thing the shell must do is find the program we want to run, “cc”. To do that, the UNIX command interpreter uses a default context reference contained in an environment variable named PATH. That environment variable contains a list of contexts (in this case directories) in which to perform a multiple lookup for the thing named “cc”. Assuming the lookup is successful, the shell launches the program, calling it with the argument “x.c”.
编译器所做的第一件事是尝试解析名称“xc”。这次它使用不同的默认上下文引用:工作目录。一旦编译开始,文件“xc”可能包含对其他命名文件的引用,例如,以下语句
The first thing the compiler does is try to resolve the name “x.c”. This time it uses a different default context reference: the working directory. Once the compilation is underway, the file “x.c” may contain references to other named files, for example, statements such as
#包括 <stdio.h>
#include <stdio.h>
此语句告诉编译器将文件“stdio.h”中的所有定义包含在文件“xc”中。要解析“stdio.h”,编译器需要一个解析它的上下文。为此,编译器会查阅另一个变量(通常在调用编译器时作为参数传递),该变量包含一个默认上下文,用作可以找到包含文件的搜索路径。shell 和编译器使用的变量均由一系列路径名组成,用作有序多重查找的基础,正如第2.2.4 节中所述。
This statement tells the compiler to include all definitions in the file “stdio.h” in the file “x.c”. To resolve “stdio.h”, the compiler needs a context in which to resolve it. For this purpose, the compiler consults another variable (typically passed as an argument when invoking the compiler), which contains a default context to be used as a search path where include files may be found. The variables used by the shell and by the compiler each consist of a series of path names to be used as the basis for an ordered multiple lookup just as was described in Section 2.2.4.
许多其他UNIX程序(例如文档包man)也使用环境变量中的搜索路径对文件进行多次查找。
Many other UNIX programs, such as the documentation package, man, also do multiple lookups for files using search paths found in environment variables.
shell 使用PATH变量来解析命令的名称,但有时说“我想运行位于当前工作目录中的程序”会很方便。例如,用户可能正在开发新版本的 C 编译器,也称为“cc”。如果用户输入“cc”,shell 将使用PATH变量查找 C 编译器,并在当前工作目录中找到标准编译器,而不是新编译器。
The shell resolves names for commands using the PATH variable, but sometimes it is convenient to be able to say “I want to run the program located in the current working directory”. For example, a user may be developing a new version of the C compiler, which is also called “cc”. If the user types “cc”, the shell will look up the C compiler using the PATH variable and find the standard one instead of the new one in the current working directory.
对于这些情况,用户可以输入以下命令:
For these cases, users can type the following command:
./cc xc
./cc x.c
它绕过PATH变量并调用当前工作目录(“。”)中名为“ cc ”的程序。
which bypasses the PATH variable and invokes the program named “cc” in the current working directory (“.”).
当然,用户可以在PATH变量的开头插入“.” ,这样用户工作目录中的所有程序都将优先于相应的标准程序。但是,这种做法可能会产生一些意外。假设“.”是PATH变量中的第一个条目,并且用户向 shell 发出以下命令序列:
Of course, the user could insert “.” at the beginning of the PATH variable, so that all programs in the user’s working directory will take precedence over the corresponding standard program. That practice, however, may create some surprises. Suppose “.” is first entry in the PATH variable, and a user issues the following command sequence to the shell:
cd /usr/potluck
cd /usr/potluck
ls
ls
打算列出名为potluck的目录的内容。如果该目录包含一个名为ls的程序,而该程序执行的操作与标准ls命令不同,则可能会发生一些令人惊讶的事情(例如,名为ls的程序可能会删除所有私有文件)!因此,在搜索路径中包含上下文相关的名称(例如“。”或“..”)并不是一个好主意。最好将所需目录的绝对路径名包含在PATH的前面。
intending to list the contents of the directory named potluck. If that directory contained a program named ls that did something different from the standard ls command, something surprising might happen (e.g., the program named ls could remove all private files)! For this reason, it is not a good idea to include names that are context-dependent, such as “.” or “..” in a search path. It is better to include the absolute path name of the desired directory to the front of PATH.
命令解释器的另一个扩展是名称可以是描述性的,而不是简单的名称。例如,描述性名称“ * .c”匹配所有以“.c”结尾的文件名。为了提供此扩展,命令解释器在调用指定的命令程序之前将单个参数转换为参数列表(借助对上下文中的条目进行更复杂的查找操作)。在UNIX shell 中,用户可以在描述性名称中使用完整的正则表达式。
Another command interpreter extension is that names can be descriptive rather than simple names. For example, the descriptive name “*.c”, matches all file names that end with “.c”. To provide this extension, the command interpreter transforms the single argument into a list of arguments (with the help of a more complicated lookup operation on the entries in the context) before it calls the specified command program. In the UNIX shell, users can use full-blown regular expressions in descriptive names.
最后要说的是,实际上,UNIX对象命名空间具有相当多的常规结构。特别是,有几个目录具有众所周知的名称。例如,“/bin”命名程序,“/etc”命名配置文件,“/dev”命名输入/输出设备,“/usr”(而不是根目录本身)命名用户目录。随着时间的推移,这些惯例已经根深蒂固地存在于程序员的头脑和程序中,以至于许多UNIX软件无法正确安装,而UNIX向导在面对不遵循这些惯例的系统时会感到非常困惑。
As a final note, in practice, the UNIX object naming space has quite a bit of conventional structure. In particular, there are several directories with well-known names. For example, “/bin” names programs, “/etc” names configuration files, “/dev” names input/output devices, and “/usr” (rather than the root itself) names user directories. Over time these conventions have become so ingrained both in programmers’ minds and in programs that much UNIX software will not install correctly, and a UNIX wizard will become badly confused, when confronted with a system that does not follow these conventions.
有关更现代UNIX操作系统的详细描述,请参阅描述 BSD操作系统的书籍[进一步阅读建议 1.3.4 ]。原始UNIX系统的后代是 Plan 9 [进一步阅读建议 3.2.2 ],其中包含许多新颖的命名抽象,其中一些正在重新出现在较新的UNIX实现中。有大量文献描述了文件系统实现及其权衡。关于 FFS [进一步阅读建议 6.3.2 ]、LFS [进一步阅读建议 9.3.1 ] 和软更新 [进一步阅读建议 6.3.3 ]的论文是一个很好的起点。
For a detailed description of a more modern UNIX operating system, see the book describing the BSD OPERATING system [Suggestions for Further Reading 1.3.4]. A descendant of the original UNIX system is Plan 9 [Suggestions for Further Reading 3.2.2], which contains a number of novel naming abstractions, some of which are finding their way back into newer UNIX implementations. A rich literature exists describing file system implementations and their trade-offs. A good starting point are the papers on FFS [Suggestions for Further Reading 6.3.2], LFS [Suggestions for Further Reading 9.3.1], and soft updates [Suggestions for Further Reading 6.3.3].
2.1 Ben Bitdiddle 接受了电话公司的工作,并被要求实施呼叫转移。他一直在思考,如果有人将电话转接到某个号码,然后该号码的所有者又将电话转接到第三个号码,该怎么办。到目前为止,Ben 已经想到了两种实施方法:
2.1 Ben Bitdiddle has accepted a job with the telephone company and has been asked to implement call forwarding. He has been pondering what to do if someone forwards calls to some number and then the owner of that number forwards calls to a third number. So far, Ben has thought of two possibilities for his implementation:
A。跟我来。鲍勃晚上要去玛丽家参加聚会,所以他把电话转接给了玛丽。安正在帮鲍勃照看孩子,所以她把电话转接给了鲍勃。吉姆拨通了安的电话,鲍勃的电话响了,安接了电话。
a. Follow me. Bob is going to a party at Mary’s home for the evening, so he forwards his telephone to Mary. Ann is baby-sitting for Bob, so she forwards her telephone to Bob. Jim calls Ann’s number, Bob’s telephone rings, and Ann answers it.
b.委托。鲍勃晚上要去玛丽家参加聚会,所以他把电话转接给了玛丽。安这周不在家,她把电话转接给了鲍勃,这样他就可以接听她的电话。吉姆拨通了安的电话,玛丽的电话响了,玛丽把电话递给鲍勃接听。
b. Delegation. Bob is going to a party at Mary’s home for the evening, so he forwards his telephone to Mary. Ann is gone for the week and has forwarded her telephone to Bob so that he can take her calls. Jim calls Ann’s number, Mary’s telephone rings, and Mary hands the phone to Bob to take the call.
2.1a Using the terminology of the naming section of this chapter, explain these two possibilities.
2.1b如果在安将电话转接给鲍勃之前,鲍勃已经将自己的电话转接给了玛丽,会发生什么问题?
2.1b What might go wrong if Bob has already forwarded his telephone to Mary before Ann forwards her telephone to him?
2.1c The telephone company usually provides Delegation rather than Follow me. Why?
2.2考虑如下所示的文件系统命名层次结构的一部分:
2.2 Consider the part of the file system naming hierarchy illustrated in the following:
You have been handed the following path name:
/项目/系统/练习/Ex.2.2
/projects/systems/exercises/Ex.2.2
并且您即将解决该路径名的第三个组成部分,即名称“ exercises”。
and you are about to resolve the third component of that path name, the name exercises.
2.2a在路径名和图中,标识您应该为该分辨率使用的上下文以及允许定位该上下文的上下文引用。
2.2a In the path name and in the figure, identify the context that you should use for that resolution and the context reference that allows locating that context.
2.2b哪些术语“默认”、“显式”、 “内置”、“每个对象”和“每个名称”适用于此上下文引用?
2.2b Which of the terms default, explicit, built-in, per-object, and per-name apply to this context reference?
1995–2–1a
1995–2–1a
2.3加快名称解析速度的一种方法是实现一个缓存,用于记住最近查找的 {名称,对象} 对。
2.3 One way to speed up the resolving of names is to implement a cache that remembers recently looked-up {name, object} pairs.
2.3a与不支持同义词的缓存相比,同义词给缓存设计者带来了哪些问题?
2.3a What problems do synonyms pose for cache designers, as compared with caches that don’t support synonyms?
1994–2–3
1994–2–3
2.3b如果每个对象都有唯一的 ID,请提出一种解决问题的方法。
2.3b Propose a way of solving the problems if every object has a unique ID.
1994–2–3a
1994–2–3a
2.4 Louis Reasoner 开始担心 Eunuchs 系统(UNIX系统的简化版本)中搜索规则实现的效率。他建议添加一个引用对象表(ROT),系统将为每个用户的每个会话维护该表,并在用户登录时将其设置为空。每当系统通过使用搜索路径解析名称时,它都会在 ROT 中创建一个由该对象的名称和路径名组成的条目。“已引用”搜索规则只是搜索 ROT 以确定相关名称是否出现在那里。如果找到匹配项,则解析器将使用 ROT 中的相关路径名。Louis 建议始终首先使用“已引用”规则,然后使用传统的搜索路径机制。他声称用户不会发现任何差异,除了名称解析速度更快。Louis 对吗?
2.4 Louis Reasoner has become concerned about the efficiency of the search rule implementation in the Eunuchs system (an emasculated version of the UNIX system). He proposes to add a referenced object table (ROT), which the system will maintain for each session of each user, set to be empty when the user logs in. Whenever the system resolves a name through of use a search path, it makes an entry in the ROT consisting of the name and the path name of that object. The “already referenced” search rule simply searches the ROT to determine if the name in question appears there. If it finds a match, then the resolver will use the associated path name from the ROT. Louis proposes to always use the “already referenced” rule first, followed by the traditional search path mechanism. He claims that the user will detect no difference, except for faster name resolution. Is Louis right?
1985–2–2
1985–2–2
2.5图 2.4的最后一行列举了三个 Web 浏览器作为解释器的例子。通过识别 Web 浏览器的指令引用、指令集和环境引用,解释 Web 浏览器如何成为解释器。
2.5 The last line of Figure 2.4 names three Web browsers as examples of interpreters. Explain how a Web browser is an interpreter by identifying its instruction reference, its repertoire, and its environment reference.
2009-0-1
2009-0-1
与第 2 章相关的附加练习可以在从第 425 页开始的问题集中找到。
Additional exercises relating to Chapter 2 can be found in the problem sets beginning on page 425.
*内存抽象的WRITE操作创建了名称-值关联,因此可以将其视为BIND的特化实例。同样,READ操作可以视为RESOLVE的特化实例。
* The WRITE operation of the memory abstraction creates a name-value association, so it can be viewed as a specialized instance of BIND. Similarly, the READ operation can be viewed as a specialized instance of RESOLVE.
*操作系统社区传统上使用“搜索”一词来表示多项查找,但互联网和桌面上“搜索引擎”的出现使得该用法变得模糊。第2.2.4 节(第 75 页)的最后一段讨论了此主题。
* The operating system community traditionally uses the word “search” for multiple lookup, but the advent of “search engines” on both the Internet and the desktop has rendered that usage ambiguous. The last paragraph of Section 2.2.4, on page 75, discusses this topic.
*这种用多条并行线来描述的结构称为并行总线。第 7.3 节 [在线] 中对链路通信协议进行了更深入的讨论,展示了如何通过仅通过几条线发送编码信号来实现总线,这种方案称为串行总线。
* This description in terms of several parallel wires is of a structure called a parallel bus. A more thorough discussion of link communication protocols in Section 7.3 [on-line] shows how a bus can also be implemented by sending coded signals down just a few wires, a scheme called a serial bus.
*选择这些总线地址是为了方便说明。实际上,内存模块更有可能配置足够的总线地址来容纳几 GB 的数据。
* These bus addresses are chosen for convenience of the illustration. In practice, a memory module is more likely to be configured with enough bus addresses to accommodate several gigabytes.
* UNIX这个名字是由 Unics 演变而来的,这是 Multics 上的一个文字笑话。
* The name UNIX evolved from Unics, which was a word joke on Multics.
†我们使用“Linux”表示 Linux 内核,而使用“GNU/Linux”表示整个系统,同时认识到他的命名约定也不是完美的,因为系统中有些部分既不是 GNU 软件也不是内核的一部分(例如,X Window 系统;参见边栏 4.4)。
† We use “Linux” for the Linux kernel, while we use “GNU/Linux” for the complete system, recognizing that his naming convention is not perfect either, because there are pieces of the system that are neither GNU software or part of the kernel (e.g., the X Window System; see Sidebar 4.4).
‡ ‡POSIX®(可移植操作系统接口),联邦信息处理标准 (FIPS) 151-2。FIPS 151-2 采用 ISO/IEC 9945-1: 2003(IEEE Std. 1003.1: 2001)信息技术 - 可移植操作系统接口 (POSIX) - 第 1 部分:系统应用程序:程序接口 (API) [C 语言]。
‡ ‡POSIX® (Portable Operating System Interface), Federal Information Processing Standards (FIPS) 151-2. FIPS 151-2 adopts ISO/IEC 9945-1: 2003 (IEEE Std. 1003.1: 2001) Information Technology-Portable Operating System Interface (POSIX)-Part 1: System Application: Program Interface (API) [C Language].
*但是,版本 6 的实现将每个文件的最大块数限制为 2 15。
* The implementation of Version 6, however, restricts the maximum number of blocks per file to 215.
* Rob Pike。《Plan 9 中的词汇文件名或正确使用 Dot-Dot》。《2000 年 USENIX 技术会议论文集》(2000 年),圣地亚哥,第 85–92 页。
* Rob Pike. Lexical File Names in Plan 9 or Getting Dot-Dot Right. Proceedings of the 2000 USENIX Technical Conference(2000), San Diego, pages 85–92.
3.1命名方案设计中的注意事项
3.1 Considerations in the design of naming schemes
3.1.1模块化共享
3.1.1 Modular Sharing
3.1.2元数据和名称重载
3.1.2 Metadata and Name Overloading
3.1.3地址:定位对象的名称
3.1.3 Addresses: Names that Locate Objects
3.1.4生成唯一名称
3.1.4 Generating Unique Names
3.1.5目标受众和用户友好名称
3.1.5 Intended Audience and User-Friendly Names
3.1.6名称、值和绑定的相对生存期
3.1.6 Relative Lifetimes of Names, Values, and Bindings
3.1.7回顾与展望:名称是系统的基本组件
3.1.7 Looking Back and Ahead: Names are a Basic System Component
3.2 Case Study: The Uniform Resource Locator (URL)
3.2.1冲浪作为参考体验;名称发现
3.2.1 Surfing as a Referential Experience; Name Discovery
3.2.2URL 的解释
3.2.2 Interpretation of the URL
3.2.3URL 区分大小写
3.2.3 URL Case Sensitivity
3.2.4部分 URL 的上下文引用错误
3.2.4 Wrong Context References for a Partial URL
3.2.5URL 中的名称重载
3.3战争故事:姓名使用的病态
3.3 War stories: Pathologies in the Use of Names
3.3.1名字冲突让笑容消失
3.3.1 A Name Collision Eliminates Smiling Faces
3.3.2超载导致的脆弱名称以及市场解决方案
3.3.2 Fragile Names from Overloading, and a Market Solution
3.3.3超载导致更多脆弱品牌,市场混乱
3.3.3 More Fragile Names from Overloading, with Market Disruption
3.3.4用户友好名称中区分大小写
3.3.4 Case-Sensitivity in User-Friendly Names
3.3.5电话号码用完
在上一章中,我们开发了一个命名方案的抽象模型。当需要设计一个实用的命名方案时,许多工程考虑因素(约束、额外要求或需求以及环境压力)都会影响设计。用户与计算机系统交互的主要方式之一是通过名称,而用户体验的质量会受到系统命名方案质量的极大影响。同样,由于名称是连接模块的粘合剂,因此命名方案的属性会显著影响模块化对系统的影响。
In the previous chapter we developed an abstract model of naming schemes. When the time comes to design a practical naming scheme, many engineering considerations—constraints, additional requirements or desiderata, and environmental pressures—shape the design. One of the main ways in which users interact with a computer system is through names, and the quality of the user experience can be greatly influenced by the quality of the system’s naming schemes. Similarly, since names are the glue that connects modules, the properties of the naming schemes can significantly affect the impact of modularity on a system.
本章探讨了设计命名方案时涉及的工程考虑因素。正文介绍了影响模块化和可用性的各种命名考虑因素。万维网统一资源定位符 (URL) 的案例研究说明了命名模型以及命名方案设计中出现的一些问题。最后,实战故事部分探讨了实际命名方案的一些病态问题。
This chapter explores the engineering considerations involved in designing naming schemes. The main text introduces a wide range of naming considerations that affect modularity and usability. A case study of the World Wide Web Uniform Resource Locator (URL) illustrates both the naming model and some problems that arise in the design of naming schemes. Finally, a war stories section explores some pathological problems of real naming schemes.
我们首先讨论命名和模块化之间的相互作用。
We begin with a discussion of an interaction between naming and modularity.
通过名称连接模块提供了极大的灵活性,但也带来了一个风险:设计人员有时必须处理预先存在的名称,这些名称可能是由设计人员无法控制的其他人选择的。只要模块是独立设计的,就会出现这种风险。如果为了使用某个模块,设计人员必须了解并避免使用模块中组件的名称,我们就无法实现模块化的主要目标之一,即模块共享。模块共享意味着人们可以通过名称使用共享模块,而无需知道它所用模块的名称。
Connecting modules by name provides great flexibility, but it introduces a hazard: the designer sometimes has to deal with preexisting names, perhaps chosen by someone else over whom the designer has no control. This hazard can arise whenever modules are designed independently. If, in order to use a module, the designer must know about and avoid the names used within that module for its components, we have failed to achieve one of the primary goals of modularity, called modular sharing. Modular sharing means that one can use a shared module by name without knowing the names of the modules it uses.
缺乏模块共享会以名称冲突的形式表现出来,即由于某种原因,两个或多个不同的值会在同一上下文中竞争同一名称的绑定。在集成两个(或多个)独立构思的程序集、文档集、文件系统、数据库或任何使用相同命名方案进行内部互连和集成的组件集合时,可能会出现名称冲突。名称冲突可能是一个严重的问题,因为修复它需要更改一些冲突名称的用法。进行这样的更改可能很尴尬或困难,因为原始子系统的作者不一定能帮助定位、理解和更改冲突名称的用法。
Lack of modular sharing shows up in the form of name conflict, in which for some reason two or more different values compete for the binding of the same name in the same context. Name conflict can arise when integrating two (or more) independently conceived sets of programs, sets of documents, file systems, databases, or indeed any collection of components that use the same naming scheme for internal interconnection as for integration. Name conflict can be a serious problem because fixing it requires changing some of the uses of the conflicting names. Making such changes can be awkward or difficult, for the authors of the original subsystems are not necessarily available to help locate, understand, and change the uses of the conflicting names.
实现模块共享的明显方法是为每个子系统提供自己的命名上下文,然后制定上下文之间交叉引用的方法。让交叉引用正常工作是一项挑战。
The obvious way to implement modular sharing is to provide each subsystem with its own naming context, and then work out some method of cross-reference between the contexts. Getting the cross-reference to work properly turns out to be the challenge.
例如,考虑图 3.1中所示的两组程序— 一个文字处理器和一个拼写检查器 — 每个程序都包含按名称链接的模块,并且每个模块都有一个名为INITIALIZE的组件。过程WORD_PROCESSOR的设计者希望使用SPELL_CHECK作为组件。如果设计者试图通过简单地将所有程序的名称绑定在一个命名上下文中来组合这两组程序,如图所示(箭头表示每个名称的绑定),则有两个模块会竞争名称INITIALIZE的绑定。我们遇到了名称冲突。
Consider, for example, the two sets of programs shown in Figure 3.1—a word processor and a spelling checker—each of which comprises modules linked by name and each of which has a component named INITIALIZE. The designer of the procedure WORD_PROCESSOR wants to use SPELL_CHECK as a component. If the designer tries to combine the two sets of programs by simply binding all of their names in one naming context, as in the figure (where the arrows show the binding of each name), there are two modules competing for binding of the name INITIALIZE. We have a name conflict.
图 3.1仅通过合并上下文就将两套独立编写的程序集成得过于简单。过程WORD_PROCESSOR调用SPELL_CHECK,但SPELL_CHECK有一个组件与WORD_PROCESSOR的一个组件同名。没有一套单独的绑定可以做正确的事情。
Figure 3.1 Too-simple integration of two independently written sets of programs by just merging their contexts. Procedure WORD_PROCESSOR calls SPELL_CHECK, but SPELL_CHECK has a component that has the same name as a component of WORD_PROCESSOR. No single set of bindings can do the right thing.
因此,设计者反而试图为每组程序创建一个单独的上下文,如图3.2所示。这一步本身并不能完全解决问题,因为程序解释器现在需要一些规则来确定每次使用名称时使用哪个上下文。例如,假设它正在运行WORD_PROCESSOR,并且遇到了名称INITIALIZE 。它如何知道应该在WORD_PROCESSOR上下文而不是SPELL_CHECK上下文中解析这个名称?
So the designer instead tries to create a separate context for each set of programs, as in Figure 3.2. That step by itself doesn’t completely address the problem because the program interpreter now needs some rule to determine which context to use for each use of a name. Suppose, for example, it is running WORD_PROCESSOR, and it encounters the name INITIALIZE. How does it know that it should resolve this name in the context of WORD_PROCESSOR rather than the context of SPELL_CHECK?
图 3.2集成相同的两个程序,但使用不同的上下文。为SPELL_CHECK设置单独的上下文可以消除名称冲突,但程序解释器现在需要一些基础来选择一个上下文而不是另一个上下文。
Figure 3.2 Integration of the same two programs but using separate contexts. Having a separate context for SPELL_CHECK eliminates the name conflict, but the program interpreter now needs some basis for choosing one context over the other.
按照第 2 章的命名模型和电子邮件系统的示例,此问题的直接解决方案是在WORD_PROCESSOR上下文中添加SPELL_CHECK的绑定,并为每个模块附加一个显式上下文引用,如图3.3所示。此添加将需要修改模块的表示,如果某些模块属于其他人,则这种替代方法可能不方便甚至不允许。
Following the naming model of Chapter 2, and the example of the e-mail system, a direct solution to this problem would be to add a binding for SPELL_CHECK in the WORD_PROCESSOR context and attach to every module an explicit context reference, as in Figure 3.3. This addition would require tinkering with the representation of the modules, an alternative that may not be convenient or even not allowed if some of the modules belong to someone else.
图 3.3具有显式上下文引用的模块共享。每个程序模块上添加的小圆圈是上下文引用,它们告诉名称解释器要使用哪个上下文来处理在该模块中找到的名称。
Figure 3.3 Modular sharing with explicit context references. The small circles added to each program module are context references that tell the name interpreter which context to use for names found in that module.
图 3.4提出了另一种可能性:增强程序解释器以跟踪它最初发现每个程序的上下文。程序解释器将使用该上下文来解析在该程序中找到的所有名称。然后,为了允许文字处理器按名称调用拼写检查器,在WORD_PROCESSOR上下文中放置SPELL_CHECK的绑定,如图中编号为1 的实线箭头所示。(假设上下文现在是文件系统目录)。
Figure 3.4 suggests another possibility: augment the program interpreter to keep track of the context in which it originally found each program. The program interpreter would use that context for resolving all names found in that program. Then, to allow the word processor to call the spell checker by name, place a binding for SPELL_CHECK in the WORD_PROCESSOR context as shown by the solid arrow numbered 1 in that figure. (imagine that the contexts are now file system directories).
图 3.4借助独立上下文进行集成。使用独立上下文进行拼写检查可以消除名称冲突,但程序解释器仍然需要一些基础来选择一个上下文而不是另一个上下文。添加编号为1 的实线箭头不太管用,但编号为2 的虚线箭头(间接名称)可以。
Figure 3.4 Integration with the help of separate contexts. Having a separate context for spell-check eliminates the name conflict, but the program interpreter still needs some basis for choosing one context over the other. Adding the solid arrow numbered 1 doesn’t quite work, but the dashed arrow numbered 2, an indirect name, does.
这种额外的绑定会产生一个微妙的问题,可能会在以后产生意想不到的后果。由于程序解释器在文字处理器的上下文中找到了SPELL_CHECK ,其上下文选择规则(错误地)告诉它将该上下文用于它在SPELL_CHECK中找到的名称,因此SPELL_CHECK将调用错误版本的INITIALIZE 。一种解决方案是在文字处理器的上下文中放置一个间接名称(图 3.2中编号为2 的虚线箭头),并将其绑定到SPELL_CHECK自身上下文中的SPELL_CHECK名称。然后,解释器(假设它跟踪它实际找到每个程序的上下文)将正确解析在两组程序中找到的名称。
That extra binding creates a subtle problem that may produce a later surprise. Because the program interpreter found SPELL_CHECK in the word processor’s context, its context selection rule tells it (incorrectly) to use that context for the names it finds inside of SPELL_CHECK, so SPELL_CHECK will call the wrong version of INITIALIZE. A solution is to place an indirect name (the dashed arrow numbered 2 in Figure 3.2) in the word processor’s context, bound to the name of SPELL_CHECK in SPELL_CHECK’s own context. Then, the interpreter (assuming it keeps track of the context where it actually found each program) will correctly resolve names found in both groups of programs.
跟踪上下文并使用间接引用(可能通过使用文件系统目录作为上下文)很常见,但它有点临时。另一种更优雅的将上下文引用附加到对象而不修改其表示的方法是将对象的名称不直接与对象本身相关联,而是与由原始对象及其上下文引用组成的结构相关联。一些编程语言为过程定义实现了这样的结构,称为“闭包”,它将每个过程定义与其定义时的命名上下文连接起来。使用静态范围和闭包的编程语言为大型应用程序不同部分内命名对象的模块化共享提供了更系统的方案,但在文件系统或合并应用程序(如上例中的文字处理和拼写检查系统)中很少找到类似的机制。造成这种差异的原因之一是程序通常包含对许多命名对象的许多引用,因此组织良好非常重要。另一方面,合并应用程序只涉及少量的大型组件且只有少量的交叉引用,因此模块共享的临时方案似乎就足够了。
Keeping track of contexts and using indirect references (perhaps by using file system directories as contexts) is commonplace, but it is a bit ad hoc. Another, more graceful, way of attaching a context reference to an object without modifying its representation is to associate the name of an object not directly with the object itself but instead with a structure that consists of the original object plus its context reference. Some programming languages implement just such a structure for procedure definitions, known as a “closure”, which connects each procedure definition with the naming context in which it was defined. Programming languages that use static scope and closures provide a much more systematic scheme for modular sharing of named objects within the different parts of a large application program, but comparable mechanisms are rarely found* in file systems or in merging applications such as the word processing and spell-checking systems of the previous example. One reason for the difference is that a program usually contains many references to lots of named objects, so it is important to be well organized. On the other hand, merging applications involves a small number of large components with only a few cross-references, so ad hoc schemes for modular sharing may seem to suffice.
对象的名称和与之关联的上下文引用是元数据信息的两个示例。元数据信息对于了解对象很有用,但无法在对象本身中找到(或者即使位于对象内部也可能不容易找到)。图书馆书目记录是元数据的集合,包括书名、作者、出版商、出版日期、收购日期和书架位置,全部采用标准格式。图书馆在处理元数据方面经验丰富,但无法系统地组织元数据是计算机系统中经常遇到的设计缺陷。
The name of an object and the context reference that should be associated with it are two examples of a class of information called metadata—information that is useful to know about an object but that cannot be found inside the object itself (or if it is inside may not be easy to find). A library bibliographic record is a collection of metadata, including title, author, publisher, publication date, date of acquisition, and shelf location of a book, all in a standard format. Libraries have a lot of experience in dealing with metadata, but failure to systematically organize metadata is a design shortcoming frequently encountered in computer systems.
计算机系统中与对象相关的元数据的一些常见示例包括用户友好名称、唯一标识符、对象类型(可执行程序、文字处理文本、视频流等)、创建日期、上次修改日期和上次备份日期、备份副本的位置、所有者姓名、创建它的程序、用于验证其完整性的加密质量校验和(称为见证—参见边栏 7.1 [在线])、允许读取或更新对象的人员名单以及对象表示的物理位置。元数据的一个常见(但并非通用)属性是,它是有关对象的信息,可以在不更改对象本身的情况下进行更改。
Some common examples of metadata associated with an object in a computer system are a user-friendly name, a unique identifier, the type of the object (executable program, word processing text, video stream, etc.), the dates it was created, last modified, and last backed up, the location of backup copies, the name of its owner, the program that created it, a cryptographic quality checksum (known as a witness—see Sidebar 7.1 [on-line]) to verify its integrity, the list of names of who is permitted to read or update the object, and the physical location of the representation of the object. A common, though not universal, property of metadata is that it is information about an object that may be changed without changing the object itself.
在文件系统中维护元数据的一种策略是,在跟踪文件物理位置的同一文件系统结构中保留元数据的存储空间,并提供读取和更新元数据的方法。这种策略很有吸引力,因为它允许不关心元数据的应用程序轻松忽略它。因此,编译器可以读取输入文件,而不必明确识别和忽略文件所有者的姓名或文件上次备份的日期,而自动备份应用程序可以使用元数据访问方法来检查这两个字段。第2.5.1 节中描述的UNIX文件系统通过将元数据存储在 inode 中来使用此策略。
One strategy for maintaining metadata in a file system is to reserve storage for the metadata in the same file system structure that keeps track of the physical location of the file and to provide methods for reading and updating the metadata. This strategy is attractive because it allows applications that do not care about the metadata to easily ignore it. Thus, a compiler can read an input file without having to explicitly identify and ignore the file owner’s name or the date on which the file was last backed up, whereas an automatic backup application can use the metadata access method to check those two fields. The UNIX file system, described in Section 2.5.1, uses this strategy by storing metadata in inodes.
计算机文件系统几乎总是提供对每个文件的专门元数据(如其物理位置、大小和访问权限)的管理,但它们很少提供除文件名之外的用户提供的元数据。由于这种限制,我们经常会发现文件名中充斥着与使用名称作为参考几乎没有关系的元数据。*命名方案甚至可能对允许的名称施加语法规则,以支持使用元数据进行重载。名称重载的一个典型示例是文件名以扩展名结尾,该扩展名标识了文件的类型,例如文本、文字处理文档、电子表格、二进制应用程序或电影。其他示例如图3.5所示。物理地址是另一个名称重载的示例,它非常常见,以至于下一节将探讨其特殊属性。没有任何重载的名称称为纯名称。对纯名称应用的唯一有意义的操作是COMPARE、RESOLVE、BIND和UNBIND;不能通过应用解析操作从中提取元数据。另一方面,重载名称可以以两种不同的方式使用:
Computer file systems nearly always provide for management of specialized metadata about each file such as its physical location, size, and access permissions, but they rarely have any provision for user-supplied metadata other than the file name. Because of this limitation, it is common to discover that file names are overloaded with metadata that has little or nothing to do with the use of the name as a reference.* The naming scheme may even impose syntax rules on allowable names to support overloading with metadata. A typical example of name overloading is a file name that ends with an extension that identifies the type of the file, such as text, word processing document, spreadsheet, binary application program, or movie. Other examples are illustrated in Figure 3.5. A physical address is another example of name overloading that is so common that the next section explores its special properties. Names that have no overloading whatever are known as pure names. The only operations it makes sense to apply to a pure name are COMPARE, RESOLVE, BIND, and UNBIND; one cannot extract metadata from it by applying a parsing operation. An overloaded name, on the other hand, can be used in two distinct ways:
图 3.5一些重载名称和纯名称的示例。
Figure 3.5 Some examples of overloaded names and a pure name.
1.作为标识符,使用COMPARE、RESOLVE、BIND和UNBIND。
1. As an identifier, using COMPARE, RESOLVE, BIND, and UNBIND.
2. As a source from which to extract the overloaded metadata.
路径名特别容易发生重载。由于它们通过一系列上下文来描述一条路径,因此人们很容易用到达对象物理位置的路线信息来重载它们。
Path names are especially susceptible to overloading. Because they describe a path through a series of contexts, the temptation is to overload them with information about the route to the physical location of the object.
名称重载可能无害,但也可能导致违反模块化设计和抽象原则。该问题通常以脆弱名称的形式出现。例如,当需要更改移动到新物理位置的文件的名称时,即使文件的标识和内容均未发生改变,就会出现名称脆弱性。例如,假设计算平方根的库程序恰好存储在disk05上,其名称为/disk05/library/sqrt。如果disk05后来变得太满,而该库必须移动到disk06,则该程序的路径名将更改为/disk06/library/sqrt,并且必须有人跟踪和修改每个使用旧名称的情况。名称脆弱性是万维网地址停止工作的原因之一。第3.2 节中的案例研究更详细地探讨了该问题。
Overloading of a name can be harmless, but it can also lead to violation of the principles of modular design and abstraction. The problem usually shows up in the form of a fragile name. Name fragility appears, for example, when it is necessary to change the name of a file that moves to a new physical location, even though the identity and content of the file have not changed. For example, suppose that a library program that calculates square roots and that happens to be stored on disk05 is named /disk05/library/sqrt. If disk05 later becomes too full and that library has to be moved to disk06, the path name of the program changes to /disk06/library/sqrt, and someone has to track down and modify every use of the old name. Name fragility is one of the reasons that World Wide Web addresses stop working. The case study in Section 3.2 explores that problem in more detail.
这一观察的一般版本是,重载在保持名称不变的目标与修改重载信息的需要之间造成了矛盾。通常,使用名称的模块需要该名称至少在该模块存在期间保持不变。因此,必须谨慎使用重载,并了解名称的使用方式。
The general version of this observation is that overloading creates a tension between the goal of keeping names unchanged and the need to modify the overloaded information. Typically, a module that uses a name needs the name to remain unchanged for at least as long as that module exists. For this reason, overloading must be used with caution and with understanding of how the name will be used.
最后,在模块化系统中,重载名称可能要经过多个模块才能到达真正知道如何解释重载的模块。如果名称没有模块知道如何解释的重载,则称该名称对模块不透明。纯名称可以被认为对除RESOLVE之外的所有模块都不透明。
Finally, in a modular system, an overloaded name may be passed through several modules before reaching the module that actually knows how to interpret the overloading. A name is said to be opaque to a module if the name has no overloading that the module knows how to interpret. A pure name can be thought of as being opaque to all modules except RESOLVE.
元数据过载还有更微妙的形式。如果是用户的思维而不是计算机系统执行元数据提取,过载可能不那么明显。例如,在 Internet 主机名“CityClerk.Reston.VA.US”中,上下文标识符“Reston.VA.US”也可以识别为真实地点的标识符,即美国弗吉尼亚州雷斯顿镇。此名称的每个组成部分都用于命名两个不同的现实世界事物:名称“Reston”既标识一个城镇,又标识一个名称/值对表,该表充当可查找市政部门名称的上下文。由于它具有助记价值,人们发现通过过载进行这种重用很有帮助——假设它做得准确且一致。(另一方面,如果有人将芝加哥的万维网服务命名为“ SaltLakeCity.net ”,看到该名称的人很可能会错误地认为它实际上位于盐湖城。)
There are also more subtle forms of metadata overloading. Overloading can be less obvious if the user’s mind, rather than the computer system, performs the metadata extraction. For example, in the Internet host name “CityClerk.Reston.VA.US”, the identifier of the context, “Reston.VA.US”, is also recognizable as the identifier of a real place, a town named Reston, Virginia, in the United States. Each component of this name is being used to name two different real-world things: the name “Reston” identifies both a town and a table of name/value pairs that acts as a context in which the name of a municipal department may be looked up. Because it has mnemonic value, people find this reuse by overloading helpful—assuming that it is done accurately and consistently. (On the other hand, if someone names a World Wide Web service in Chicago “SaltLakeCity.net” people seeing that name are likely to assume—incorrectly—that it is actually located in Salt Lake City.)
在计算机系统中,地址是物理位置的名称或映射到物理位置的虚拟位置的名称。计算机系统由真实的物理对象构成,因此其中有大量地址示例:寄存器编号、物理和虚拟内存地址、处理器编号、磁盘扇区编号、可移动介质卷编号、I/O 通道编号、通信链路标识符、网络连接点地址、显示器上的像素位置 — 列表似乎无穷无尽。
In a computer system, an address is the name of a physical location or of a virtual location that maps to a physical location. Computer systems are constructed of real physical objects, so they abound in examples of addresses: register numbers, physical and virtual memory addresses, processor numbers, disk sector numbers, removable media volume numbers, I/O channel numbers, communication link identifiers, network attachment point addresses, pixel positions on a display—the list seems endless.
地址不是纯粹的名称。地址的特点是它被重载,解析地址可以提供命名对象在某个虚拟或真实坐标系中的位置指南。与其他重载名称一样,地址可以以两种方式使用,在本例中:
Addresses are not pure names. The thing that characterizes an address is that it is overloaded in such a way that parsing the address provides a guide to the location of the named object in some virtual or real coordinate system. As with other overloaded names, addresses can be used in two ways, in this case:
因此,“列奥纳多·达·芬奇”是一个标识符,它曾经与一个物理人绑定,现在与那个列奥纳多的记忆绑定。这个标识符可以用来比较,以避免与列奥纳多·迪·比萨在佛罗伦萨时混淆。*今天,这个标识符有助于避免混淆他们的作品。同时,“列奥纳多·达·芬奇”也是一个定位器;它表明,如果你想检查那个列奥纳多的出生记录,你应该查看名为芬奇的城镇的档案。
Thus, “Leonardo da Vinci” is an identifier that was once bound to a physical person and is now bound to the memory of that Leonardo. This identifier could have been used in comparisons to avoid confusion with Leonardo di Pisa when both of them were visiting Florence.* Today, the identifier helps avoid mixing up their writings. At the same time, “Leonardo da Vinci” is also a locator; it indicates that if you want to examine the birth record of that Leonardo, you should look in the archives of the town named Vinci.
由于对许多物理设备的访问是几何的,因此地址通常从紧凑的整数集中选择,这样地址相邻性就与物理相邻性相对应,而诸如“加 1”或从一个地址中减去另一个地址之类的算术运算具有有用的物理意义。例如,寻道臂通过计算其经过的轨道数来找到磁盘上的轨道 #1079,而磁盘臂调度程序会查看轨道地址的差异以决定执行寻道的最佳顺序。再举一个例子,内存芯片包含一个位数组,每个位都有一个唯一的整数地址。当对特定地址的读取或写入请求到达芯片时,芯片会将该地址的各个位路由到选择器,这些选择器引导信息流向和流出目标存储位。
Since access to many physical devices is geometric, addresses are often chosen from compact sets of integers in such a way that address adjacency corresponds to physical adjacency, and arithmetic operations such as “add 1” or subtracting one address from another have a useful, physical meaning. For example, a seek arm finds track #1079 on a magnetic disk by counting the number of tracks it passes, and a disk arm scheduler looks at differences in track addresses to decide the best order in which to perform seeks. For another example, a memory chip contains an array of bits, each of which has a unique integer address. When a read or write request for a particular address arrives at the chip, the chip routes individual bits of that address to selectors that guide the flow of information to and from the intended bit of storage.
有时对地址进行算术运算是不合适的,即使它们是从紧凑的整数集中选择的。例如,电话号码(技术上称为“目录号”)是整数,其区域和交换代码中载有路由信息,但两个具有连续地址的区号之间不一定有物理相邻性。同样,两个具有连续目录号的电话之间也不一定有物理相邻性。(几十年前,电话交换设备内部的连续目录号有物理相邻性,但这种相邻性太过严格,因此被摒弃,在电话交换设备中引入了一层间接层。)
Sometimes it is inappropriate to apply arithmetic operations to addresses, even when they are chosen from compact sets of integers. For example, telephone numbers (known technically as “directory numbers”) are integers that are overloaded with routing information in their area and exchange codes, but there is no necessary physical adjacency of two area codes that have consecutive addresses. Similarly, there is no necessary physical adjacency of two telephones that have consecutive directory numbers. (In decades past, there was physical adjacency of consecutive directory numbers inside the telephone switching equipment, but that adjacency was so constraining that it was abandoned by introducing a layer of indirection as part of the telephone switch gear.)
地址中过载的位置信息可能会导致名称脆弱。当对象移动时,它的地址(从而它的名称)会改变。为此,系统设计人员通常遵循电话交换系统的例子:他们应用设计原则将模块与间接解耦以隐藏地址。添加一个间接层提供了从某些外部可见但稳定的名称到地址的绑定,当对象移动到新位置时,该地址可以轻松更改。理想情况下,地址永远不需要暴露在直接操作对象的解释层之上。因此,例如,具有通信端口的个人计算机的用户可以使用诸如COM1之类的端口名称来编写程序,而不是使用十六进制地址(如 4D7C hex ),后者在更换端口卡时可能会更改为 4D7E hex 。
The overloaded location information found in addresses can cause name fragility. When an object moves, its address, and thus its name, changes. For this reaxsson, system designers usually follow the example of telephone switching systems: they apply the design principle decouple modules with indirection to hide addresses. Adding a layer of indirection provides a binding from some externally visible, but stable, name to an address that can easily be changed when the object moves to a new location. Ideally, addresses never need to be exposed above the layer of interpretation that directly manipulates the objects. Thus, for example, the user of a personal computer that has a communication port may be able to write programs using a name such as COM1 for the port, rather than a hexadecimal address such as 4D7Chex, which may change to 4D7Ehex when the port card is replaced.
当必须更改名称,因为它被用作未被间接层隐藏的地址时,事情会变得更加复杂,并且可能会开始出错。命名方案中至少使用了四种替代方案:
When a name must be changed because it is being used as an address that is not hidden by a layer of indirection, things become more complicated and they may start to go wrong. At least four alternatives have been used in naming schemes:
搜索并更改旧地址的所有用途。这种替代方案充其量也是一种麻烦。在大型或地理分布的系统中,这可能会非常痛苦。搜索通常会错过名称的一些用途,而这些用户在下次尝试使用该名称时,要么收到一个令人费解的未找到对象的响应,而该对象仍然存在,或者更糟的是,发现旧地址现在指向不同的对象。因此,此方案可以与下一个方案结合使用。
Search for and change all uses of the old address. At best, this alternative is a nuisance. In a large or geographically distributed system, it can be quite painful. The search typically misses some uses of the name, and those users, on their next attempted use of the name, either receive a puzzling not-found response for an object that still exists or, worse, discover that the old address now leads to a different object. For that reason, this scheme may be combined with the next one.
计划名称的用户在收到未找到响应或检测到地址已重新绑定到其他对象时必须对该对象进行基于属性的搜索。如果搜索找到正确的对象,则其新地址可以替换旧地址,至少对于该用户而言。其他用户必须进行另一次搜索。
Plan that users of the name must undertake an attribute-based search for the object if they receive a not-found response or detect that the address has been rebound to a different object. If the search finds the correct object, its new address can replace the old one, at least for that user. A different user will have to do another search.
如果命名方案提供同义词或间接名称,请添加绑定,以便新旧地址都可以继续标识对象。如果地址稀缺且必须重复使用,则此替代方案并不吸引人。
If the naming scheme provides either synonyms or indirect names, add bindings so that both the old and new addresses continue to identify the object. If addresses are scarce and must be reused, this alternative is not attractive.
如果名称与活跃代理绑定,例如接受邮件的邮局服务,请在旧地址放置活跃中介,例如邮件转发器。
If the name is bound to an active agent, such as a post office service that accepts mail, place an active intermediary, such as a mail forwarder, at the old address.
这些替代方案可能都不吸引人。更好的方法几乎总是让设计人员将地址隐藏在间接层后面。第 3.3.2 节提供了此问题的示例以及使用间接的解决方案。练习 2.1探讨了电话系统中与呼叫转移功能相关的一些有趣的间接相关命名问题。
None of these alternatives may be attractive. The better method is nearly always for the designer to hide addresses behind a layer of indirection. Section 3.3.2 provides an example of this problem and the solution using indirection. Exercise 2.1 explores some interesting indirection-related naming problems in the telephone system related to the feature known as call forwarding.
有人可能会建议只使用纯名称(即没有重载的名称)来避免名称脆弱性问题。这种方法的问题在于很难找到对象。当最低层名称不带有重载寻址元数据时,将该名称解析为物理对象的唯一方法是搜索所有名称的枚举。如果上下文很小且是本地的,则该技术可能是可以接受的。如果上下文是通用的并且分布广泛,则名称解析将变得非常困难。例如,考虑定位火车车厢的问题,仅给出车厢侧面涂上的唯一序列号。如果出于某种原因您知道车厢在特定的侧线上,则搜索可能很简单,但如果车厢可能位于大陆的任何地方,则搜索是一项艰巨的任务。
One might suggest avoiding the name fragility problem by using only pure names, that is, names with no overloading. The trouble with that approach is that it makes it difficult to locate the object. When the lowest-layer name carries no overloaded addressing metadata, the only way to resolve that name to a physical object is by searching through an enumeration of all the names. If the context is small and local, that technique may be acceptable. If the context is universal and widely distributed, name resolution becomes quite problematic. Consider, for example, the problem of locating a railway car, given only a unique serial number painted on its side. If for some reason you know that the car is on a particular siding, searching may be straightforward, but if the car can be anywhere on the continent, searching is a daunting prospect.
在唯一标识符名称空间中,需要某种协议来确保所有名称实际上都是唯一的。通常的方法是让命名方案为新创建的对象生成名称,而不是依靠创建者提出唯一名称。生成唯一名称的一种简单方案是分配连续的整数或足够精细的时间戳值。侧栏 3.1显示了一个例子。生成唯一名称的另一种方案是从足够大的名称空间中随机选择它们。这个想法是使意外选择相同名称两次(一种称为碰撞的名称冲突形式)的概率小到可以忽略不计。这种方案的问题在于有限状态机很难产生真正的随机性,因此意外产生名称冲突的几率可能比从名称空间的大小预测的要高得多。必须进行仔细的设计,例如,使用高质量的伪随机数生成器并使用唯一输入(例如系统启动时创建的时间戳)为其播种。这种设计的一个例子是 Apollo DOMAIN操作系统内部使用的命名系统,它为整个局域网中的所有对象提供唯一标识符,从而为系统用户提供高度透明度;有关更多详细信息,请参阅进一步阅读建议 3.2.1。
In a unique identifier name space, some protocol is needed to ensure that all of the names actually are unique. The usual approach is for the naming scheme to generate a name for a newly created object, rather than relying on the creator to propose a unique name. One simple scheme for generating unique names is to dole out consecutive integers or sufficiently fine timestamp values. Sidebar 3.1 shows an example. Another scheme for generating unique names is to choose them at random from a sufficiently large name space. The idea is to make the probability of accidentally choosing the same name twice (a form of name conflict called a collision) negligibly small. The trouble with this scheme is that it is hard for a finite-state machine to create genuine randomness, so the chance of accidentally creating a name collision may be much higher than one would predict from the size of the name space. One must apply careful design, for example, by using a high-quality pseudorandom number generator and seeding it with a unique input such as a timestamp that was created when the system started. An example of such a design is the naming system used inside the Apollo DOMAIN operating system, which provided unique identifiers for all objects across a local-area network to provide a high-degree of transparency to users of the system; for more detail, see Suggestions for Further Reading 3.2.1.
侧边栏 3.1 根据时间戳生成唯一名称
Sidebar 3.1 Generating a Unique Name from a Timestamp
一些银行系统会为每笔交易生成一个唯一的字符串名称。典型的名称生成方案是读取数字时钟以获取时间戳,并将时间戳转换为字符串。典型的时间戳可能包含自 2000 年 1 月 1 日以来的微秒数。50 位时间戳将在大约 35 年后重复,这可能足以满足银行的目的。假设2007 年 4 月 1 日下午1:35 的时间戳为
Some banking systems generate a unique character-string name for each transaction. A typical name generation scheme is to read a digital clock to obtain a timestamp and convert the timestamp to a character string. A typical timestamp might contain the number of microseconds since January 1, 2000. A 50-bit timestamp would repeat after about 35 years, which may be sufficient for the bank’s purpose. Suppose the timestamp at 1:35 P.M. on April 1, 2007, is
00010111110110101101001100110011001100010111010011001
00010111110110101101001100111001100010111010011001
要将此位串转换为字符串,请将其分成五位块,并将每块解释为 32 个字母数字字符表的索引。五位块为:
To convert this string of bits to a character string, divide it into five-bit chunks and interpret each chunk as an index into a table of 32 alphanumeric characters. The five-bit chunks are:
00010–11111–01101–01101–00110–01110–01100–01011–10100–11001
00010–11111–01101–01101–00110–01110–01100–01011–10100–11001
接下来,将块重新解释为索引号:
Next, reinterpret the chunks as index numbers:
2 31 13 13 6 16 12 11 20 25
2 31 13 13 6 16 12 11 20 25
然后在这张包含 32 个字母数字字符的表中查找这些数字:
Then look those numbers up in this table of 32 alphanumeric characters:
结果是 10 个字符的唯一名称“ D9RRJ-UQTYP ”。您可能在使用网上银行系统进行的交易中见过类似的唯一名称。
The result is the 10-character unique name “D9RRJ-UQTYP”. You may have seen similar unique names in transactions performed with an on-line banking system.
对于具有二进制表示且在命名时已存在的对象,避免产生名称冲突的另一种方法是选择对象的内容作为其唯一名称。这种方法会为两个具有相同内容的对象分配相同的名称。然而,在某些应用程序中,这可能是一个功能 - 它提供了一种发现不需要的重复副本存在的方法。该名称可能相当长,因此更实用的方法是使用其内容的较短版本(称为哈希)作为名称。例如,人们可以通过加密转换函数运行存储文件的内容,该函数的输出是适度固定长度的位串,并使用该位串作为名称。安全哈希算法(SHA,在侧边栏 11.8 [在线] 中描述)的一个版本对于任何大小的输入都会产生长度为 160 位的输出。如果转换函数的质量足够高,两个不同的文件几乎肯定会有不同的名称。
Yet another way to avoid generated name collisions, for an object that has a binary representation and that already exists when it is being named, is to choose as its unique name the contents of the object. This approach assigns two objects with the same content the same name. In some applications, however, that may be a feature—it provides a way of discovering the existence of unwanted duplicate copies. That name is likely to be fairly long, so a more practical approach is to use as the name a shorter version of its contents, known as a hash. For example, one might run the contents of a stored file through a cryptographic transformation function whose output is a bit string of modest, fixed length, and use that bit string as the name. One version of the Secure Hash Algorithm (SHA, described in Sidebar 11.8 [on-line]) produces, for any size of input, an output that is 160 bits in length. If the transforming function is of sufficiently high quality, two different files will almost certainly end up with different names.
任何基于命名对象内容的命名方案的主要问题是名称过载。当有人修改一个对象(其名称由其原始内容构成)时,就会出现是否要更改其名称的问题。这个问题不会出现在不允许修改对象的保存存储系统中,因此这些系统中有时会使用哈希生成的唯一名称。
The main problem with any naming scheme that is based on the contents of the named object is that the name is overloaded. When someone modifies an object whose name was constructed from its original contents, the question that arises is whether to change its name. This question does not come up in preservation storage systems that do not allow objects to be modified, so hash-generated unique names are sometimes used in those systems.
唯一标识符和生成的名称还可以用于唯一标识符名称空间以外的地方。例如,当程序需要为临时文件命名时,它可能会分配一个生成的名称并将文件放在用户的工作目录中。在这种情况下,名称生成器的设计挑战是提出一种不会与人们选择的现有名称或其他自动名称生成器生成的名称发生冲突的算法。第 3.3.1 节给出了一个未能应对这一挑战的系统示例。
Unique identifiers and generated names can also be used in places other than unique identifier name spaces. For example, when a program needs a name for a temporary file, it may assign a generated name and place the file in the user’s working directory. In this case, the design challenge for the name generator is to come up with an algorithm that will not collide with the names of already existing names chosen by people or generated by other automated name generators. Section 3.3.1 gives an example of a system that failed to meet this challenge.
在大型地理分布系统中提供唯一名称需要精心设计。一种方法是创建分层命名方案。这个想法利用了层次结构的一个重要特性:委托。例如,互联网的目标是允许在通用名称空间中为计算机的连接点创建数亿个不同的唯一名称。如果试图通过让国际电信联盟的某个人协调名称分配来实现这一目标,那么大量的名称分配几乎肯定会导致长时间的延迟以及意外名称冲突形式的错误。相反,某个中央机构会分配名称“edu”或“uk”,并将命名以此后缀结尾的事物的责任委托给其他人——在“edu”的情况下,是分配大学名称的专家。该专家接受教育机构的请求,例如,分配名称“pedantic”,从而将以后缀为“.pedantic.edu”结尾的名称的责任委托给 Pedantic University 网络工作人员。该工作人员将名称“cse”分配给计算机科学与工程系,进一步将以“.cse.pedantic.edu”为后缀的名称的责任委托给该系的某个人。系内的网络管理员可以借助墙上张贴的列表或小型在线数据库,分配一个本地唯一的名称(如“ginger”),同时可以确信完全合格的名称“ ginger.cse.pedantic.edu ”也是全球唯一的。
Providing unique names in a large, geographically distributed system requires careful design. One approach is to create a hierarchical naming scheme. This idea takes advantage of an important feature of hierarchy: delegation. For example, a goal of the Internet is to allow creation of several hundred million different, unique names in a universal name space for attachment points for computers. If one tried to meet that goal by having someone at the International Telecommunications Union coordinating name assignment, the immense number of name assignments would almost certainly lead to long delays as well as mistakes in the form of accidental name collisions. Instead, some central authority assigns the name “edu” or “uk” and delegates the responsibility for naming things ending with that suffix to someone else—in the case of “edu”, a specialist in assigning university names. That specialist accepts requests from educational institutions and, for example, assigns the name “pedantic” and thereby delegates the responsibility for names ending with the suffix “.pedantic.edu” to the Pedantic University network staff. That staff assigns the name “cse” to the Computer Science and Engineering Department, further delegating responsibility for names ending with the suffix “.cse.pedantic.edu” to someone in that department. The network manager inside the department can, with the help of a list posted on the wall or a small on-line database, assign a name such as “ginger” that is locally unique and at the same time can be confident that the fully qualified name “ginger.cse.pedantic.edu” is also globally unique.
唯一标识符名称空间的另一个示例是商用以太网的寻址方案。每个以太网接口都有一个唯一的 48 位媒体访问控制(MAC)地址,通常由制造商设置到硬件中。为了能够唯一地进行此分配,但没有一个适用于全世界的单一中央注册表,MAC 地址有一个较浅的层次结构。标准制定机构为每个以太网接口制造商分配一个 MAC 地址块,所有地址都以相同的前缀开头。然后,制造商可以以任何方便的方式自由分配该块内的 MAC 地址。如果制造商用完了一个地址块中的所有 MAC 地址,它会向中央机构申请另一个地址块,该地址块的前缀可能与该制造商使用的前一个前缀无关。
A different example of a unique identifier name space is the addressing plan for the commercial Ethernet. Every Ethernet interface has a unique 48-bit media access control (MAC) address, typically set into the hardware by the manufacturer. To allow this assignment to be made uniquely, but without a single central registry for the whole world, there is a shallow hierarchy of MAC addresses. A standards-setting authority allocates to each Ethernet interface manufacturer a block of MAC addresses, all of which start with the same prefix. The manufacturer is then free to allocate MAC addresses within that block in any way that is convenient. If a manufacturer uses up all the MAC addresses in a block, it applies to the central authority for another block, which may have a prefix that has no relation to the previous prefix used by that same manufacturer.
这种策略的一个后果(在大型网络中尤其明显)是以太网接口的 MAC 地址不提供任何可用于物理定位接口卡的过载信息。尽管 MAC 地址是分层分配的,但层次结构仅用于委托和分散地址分配,并且它与任何有助于定位它的属性(例如卡连接到网络的物理位置)没有确定的关系。就像仅知道其唯一标识符即可定位火车车厢一样,将 MAC 地址解析为承载它的特定物理设备非常困难,除非已经知道从哪里开始查找。
One consequence of this strategy, especially noticeable in a large network, is that the MAC address of an Ethernet interface does not provide any overloading information that is useful for physically locating the interface card. Even though the MAC address is assigned hierarchically, the hierarchy is used only to delegate and thus decentralize address assignment, and it has no assured relation to any property (such as the physical place where the card attaches to the network) that would help locate it. Just as in locating a railway car knowing only its unique identifier, resolving a MAC address to the particular physical device that carries it is difficult unless one already has a good idea where to start looking.
人们在努力弄清楚如何将软件许可证绑定到特定计算机时,有时会建议将许可证与该计算机的以太网 MAC 地址相关联,因为该地址是全局唯一的。除了某些计算机没有以太网接口而其他计算机有多个以太网接口的问题之外,这种方法的一个问题是,如果计算机上的以太网接口卡发生故障并需要更换,新卡将具有不同的 MAC 地址,即使系统、软件及其所有者的位置保持不变。此外,如果发生故障的卡后来被修复并重新安装到另一个系统中,那么另一个系统现在将具有先前与第一个系统关联的 MAC 地址。因此,MAC 地址只能被视为特定硬件组件的唯一名称,而不是嵌入它的系统的名称。
People struggling to figure out how to tie a software license to a particular computer sometimes propose to associate the license with the Ethernet MAC address of that computer because that address is globally unique. Apart from the problem that some computers have no Ethernet interface and others have more than one, a trouble with this approach is that if an Ethernet interface card on the computer fails and needs to be replaced, the new card will have a different MAC address, even though the location of the system, the software, and its owner are unchanged. Furthermore, if the card that failed is later repaired and reinstalled in another system, that other system will now have the MAC address that was previously associated with the first system. The MAC address is thus properly viewed only as the unique name of a specific hardware component, not of the system in which it is embedded.
决定由可替换组件构成的系统的唯一标识最终是一种惯例,需要命名方案的设计者任意选择。这种选择类似于确定木船身份的问题。如果在 300 年的时间里,船上的每一块木头都被更换了,它还是同一艘船吗?显然,船舶登记处的回答是“是”。他们不会将船名与任何单个组件相关联;而是将名称与整个船舶相关联。回答这个身份问题可以澄清第2.2.5 节中讨论的COMPARE操作的三个含义中的哪一个最适合特定设计。
Deciding what constitutes the unique identity of a system that is constructed of replaceable components is ultimately a convention that requires an arbitrary choice by the designer of the naming scheme. This choice is similar to the question of establishing the identity of wooden ships. If, over the course of 300 years, every piece of wood in the ship has been replaced, is it still the same ship? Apparently, ship registries say “yes”. They do not associate the name of the ship with any single component; the name is instead associated with the ship as a whole. Answering this identity question can clarify which of the three meanings of the COMPARE operation that was discussed in Section 2.2.5 is most appropriate for a particular design.
有些命名方案是供人使用的。此类名称空间中的名称通常是用户选择的、具有助记值的、用户友好的字符串,例如“经济报告”、“购物清单”或“Joe.Smith”,并被广泛用作文件和电子邮箱的名称。在解析用户友好名称时,歧义(即非唯一性)可能是可以接受的,因为在交互式系统中,可以要求使用该名称的人解决歧义问题。
Some naming schemes are intended to be used by people. Names in such a name space are typically user-chosen and user-friendly strings of characters with mnemonic value such as “economics report”, “shopping list”, or “Joe.Smith” and are widely used as names of files and e-mailboxes. Ambiguity (that is, non-uniqueness) in resolving user-friendly names may be acceptable because in interactive systems the person using the name can be asked to resolve the ambiguity.
其他命名方案主要供机器使用。在这些方案中,名称不需要具有助记值,因此它们通常是整数,通常具有固定宽度,旨在适合寄存器并允许快速且明确的解析。内存地址和磁盘扇区地址就是例子。有时术语标识符用于不打算让人理解的名称,但这种用法绝不是普遍的。供机器使用的名称通常是机械选择的。
Other naming schemes are intended primarily for use by machines. In these schemes, the names need not have mnemonic value, so they are typically integers, often of fixed width designed to fit into a register and to allow fast and unambiguous resolution. Memory addresses and disk sector addresses are examples. Sometimes the term identifier is used for a name that is not intended to be intelligible to people, but this usage is by no means universal. Names intended for use by machines are usually chosen mechanically.
当名称旨在方便用户使用时,就会产生一种矛盾,一方面需要它是唯一、易于解析的标识符,另一方面需要尊重其他非技术性价值,例如易于记忆或与某个现有地名或人名相同。这种矛盾可以通过除了用户友好的名称之外,保留第二个面向机器的标识符来解决——因此,大型公司的计费系统通常同时具有帐户名称和帐号。第二个标识符可以是唯一的,从而解决歧义并避免与帐户名称过载相关的问题。例如,个人姓名通常充斥着家族历史元数据(例如姓氏、与母亲姓氏相同的中间名或附加的“Jr.”或“III”),并且它们通常不是唯一的。要求选择唯一个人姓名的提议总是因文化和个人身份的反对而失败。为了避免这些问题,大多数维护个人记录的系统都会为人们分配不同的唯一标识符,并在其元数据中包含用户友好的名称和唯一标识符。
When a name is intended to be user-friendly, a tension arises between a need for it to be a unique, easily resolvable identifier and a need to respect other, non-technical values such as being easy to remember or being the same as some existing place or personal name. This tension may be resolved by maintaining a second, machine-oriented identifier, in addition to the user-friendly name—thus billing systems for large companies usually have both an account name and an account number. The second identifier can be unique and thus resolve ambiguities and avoid problems related to overloading of the account name. For example, personal names are usually overloaded with family history metadata (such as the surname, a given middle name that is the same as a mother’s surname, or an appended “Jr.” or “III”), and they are frequently not unique. Proposals to require that personal names be chosen uniquely always founder on cultural and personal identity objections. To avoid these problems, most systems that maintain personal records assign distinct unique identifiers to people, and include both the user-friendly name and the unique identifier in their metadata.
在选择用户友好名称时,另一个矛盾的例子是大写字母和小写字母的使用。直到 20 世纪 60 年代中期,计算机系统只使用大写字母,而打印出来的计算机输出似乎总是杂乱无章。有一些终端和打印机有小写字母,但人们必须编写一个依赖于设备的应用程序才能使用该功能,就像今天人们必须编写一个依赖于设备的应用程序才能使用虚拟现实头盔一样。1965 年,Multics 分时系统的设计人员将小写字母引入到文件系统的名称中。这是第一次有人尝试这样做,但他们犯了错误。UNIX文件系统的设计人员复制了这个错误。反过来,许多现代文件系统复制了UNIX设计,以避免更改广泛使用的界面。错误在于名称“Court docket 5”和“Court Docket 5”可以绑定到不同的文件。由此导致的对最小惊讶原则的违反可能导致严重的混乱,因为计算机严格执行了大多数人习惯在纸上忽略的区别。强制执行此区别的系统称为区分大小写。
Another example of tension in the choice of user-friendly names is found in the use of capital and small letters. Up through the mid-1960s, computer systems used only capital letters, and printed computer output always seemed to be shouting. There were a few terminals and printers that had lower-case letters, but one had to write a device-dependent application to make use of that feature, just as today one has to write a device-dependent application to use a virtual reality helmet. In 1965, the designers of the Multics time-sharing system introduced lower-case alphabetics to names of the file system. This being the first time anyone had tried it, they got it wrong. The designers of the UNIX file system copied the mistake. In turn, many modern file systems copy the UNIX design in order to avoid changing a widely used interface. The mistake is that the names “Court docket 5” and “Court Docket 5” can be bound to different files. The resulting violation of the principle of least astonishment can lead to significant confusion, since the computer rigidly enforces a distinction that most people are accustomed to overlooking on paper. Systems that enforce this distinction are called case-sensitive.
允许名称中使用大写和小写字母的一种更用户友好的方法是,允许用户指定用于存储和显示名称的大小写字母的首选组合,但在进行名称比较时将所有字母强制为相同的大小写。因此,当其他人键入名称时,大小写不必与显示形式完全匹配。以这种方式运行的系统称为保留大小写的系统。Internet 域名系统(如第 4.4 节所述)和 Macintosh 文件系统都提供了这种更用户友好的命名界面。一种不太令人满意的减少大小写混淆的方法是强制大小写,其中所有名称都被强制为一种大小写并存储在一种大小写中。大小写强制系统以一种可能干扰良好人体工程学的方式限制了名称的出现。
A more user-friendly way to allow upper- and lower-case letters in names is to permit the user to specify a preferred combination of upper- and lower-case letters for storage and display of a name, but coerce all alphabetic characters to the same case when doing name comparisons. Thus, when another person types the name, the case does not have to precisely match the display form. Systems that operate this way are called case-preserving. Both the Internet Domain Name System (described in Section 4.4) and the Macintosh file system provide this more user-friendly naming interface. A less satisfactory way to reduce case confusion is case-coercing, in which all names are both coerced to and stored in one case. A case-coercing system constrains the appearance of names in a way that can interfere with good human engineering.
第 3.2 节中的案例研究和第 3.3 节中的战争故事描述了系统设计混合区分大小写和保留大小写的命名系统时产生的一些不寻常的结果。
The case studies in Section 3.2 and the war stories in Section 3.3 describe some unusual results when a system design mixes case-sensitive and case-preserving naming systems.
用户友好的名称并不总是字符串。在图形用户界面 (GUI) 中,图标在屏幕上的形状(有时是位置)是一种标识符,其作用与名称完全相同,即使它没有与之关联的字符串。当用户单击鼠标时,系统执行的操作取决于鼠标光标当时的位置,而在视频游戏中,操作可能取决于同时发生的其他事情。因此,标识符与时间和屏幕上的位置绑定,而该值的组合又是与某些操作绑定的标识符。
User-friendly names are not always strings of characters. In a graphical user interface (GUI), the shape (and sometimes the position) of an icon on the display is an identifier that acts exactly like a name, even if a character string is not associated with it. What action the system undertakes when the user clicks the mouse depends on where the mouse cursor was at that instant, and in a video game the action may depend on what else is happening at the same time. The identifier is thus bound to a time and a position on the screen, and that combination of values is in turn an identifier that is bound to some action.
另一个类似的用户友好名称示例(不采用字符串形式)是麻省理工学院莎士比亚项目开发的交叉链接系统。在该系统中,超文本链接表示它们来自哪里,而不是要到哪里。解析从查找找到链接的位置的标识符开始。其原理与 GUI/鼠标示例相同,该系统在侧边栏 3.2中进行了描述。
Another, similar example of a user-friendly name that does not take the form of a string of characters is the cross-linking system developed by the M.I.T. Shakespeare Project. In that system, hypertext links say where they are coming from rather than where they are going to. Resolution starts by looking up the identifier of the place where the link was found. The principle is identical to that of the GUI/mouse example, and the system is described in Sidebar 3.2.
侧边栏 3.2 莎士比亚电子档案中的超文本链接
Sidebar 3.2 Hypertext Links in the Shakespeare Electronic Archive
莎士比亚的所有戏剧都有许多版本:现代文本、十六世纪对开本和几部电影。此外,每部戏剧都有大量的元数据:评论、舞台指导、布景照片和草图、导演笔记等。在研究一部戏剧时,如果能将这些不同的版本联系起来会很有帮助,例如,如果有人对《哈姆雷特》中的台词“唉,可怜的约里克!我认识他,霍雷肖”感兴趣,他可以快速查看几个版本中的措辞,比较不同电影剪辑中出现的那句台词,并检查与那句台词相关的评论和舞台指导。
There are many representations of all of Shakespeare’s plays: a modern text, the sixteenth-century folios, and several movies. In addition, a huge amount of metadata is available about each play: commentaries, stage directions, photographs and sketches of sets, directors’ notes, and so on. In the study of a play, it would be helpful if these various representations could be linked together, so that, for example, if one were interested in the line “Alas, poor Yorick! I knew him, Horatio” from Hamlet, one could quickly check the wording in the several editions, compare different movie clips of the presentation of that line, and examine commentaries and stage directions that pertain to that line.
麻省理工学院莎士比亚项目开发了一个系统,旨在使这种交叉引用变得简单。基本方案是首先为剧本中的每一行分配一个行号,然后按行号对剧本的每个表示进行索引。用户显示一个表示,例如现代版本的文本,然后选择一行。由于版本是按行号索引的,因此该选择是与行号绑定的引用。然后用户单击选择,导致系统在多个上下文之一中查找相关的行号,每个上下文对应于其他表示之一。用户选择一个上下文,系统可以立即解析该上下文中的行号并在用户屏幕上的不同窗口中显示该表示。
The M.I.T. Shakespeare Project has developed a system intended to make this kind of cross-reference easy. The basic scheme is first to assign a line number to every line in the play and then index every representation of the play by line number. A user displays one representation, for example, the text of a modern edition, and selects a line. Because the edition is indexed by line number, that selection is a reference that is bound to the line number. The user then clicks on the selection, causing the system to look up the associated line number in one of several contexts, each context corresponding to one of the other representations. The user selects a context, and the system can immediately resolve the line number in that context and display that representation in a different window on the user’s screen.
如果必须从短的、固定长度的位或字符串的名称空间中选择名称,则名称的数量本质上是有限的。设计者可以永久绑定有限名称空间的名称,例如,简单处理器的寄存器可以从 0 到 31。如果有限名称空间的名称可以动态绑定,则必须重用它们。因此,命名方案通常用某种名称分配/取消分配过程替换 BIND和UNBIND操作。此外,有限名称空间的命名方案通常会分配名称,而不是让用户选择名称。另一方面,如果名称空间是无限的,即它不会显著限制名称长度,通常可以允许用户选择任意名称。因此,北美的电话系统使用短的、固定长度的名称命名方案(例如 208-555-0175)作为电话号码,电话公司几乎总是分配号码。 (第 3.3.5 节描述了一些由此产生的问题。)另一方面,大多数现代计算机文件系统中的名称在实际应用中是不受限制的,用户可以选择它们。
If names must be chosen from a name space of short, fixed-length strings of bits or characters, they are by nature limited in number. The designer may permanently bind the names of a limited name space, as in the case of the registers of a simple processor, which may, for example, run from zero to 31. If the names of a limited name space can be dynamically bound, they must be reused. Therefore, the naming scheme usually replaces the BIND and UNBIND operations with some kind of name allocation/deallocation procedure. In addition, the naming scheme for a limited name space typically assigns the names, rather than letting the user choose them. On the other hand, if the name space is unlimited, meaning that it does not significantly constrain name lengths, it is usually possible to allow the user to choose arbitrary names. Thus, the telephone system in North America uses a naming scheme with short, fixed-length names such as 208–555–0175 for telephone numbers, and the telephone company nearly always assigns the numbers. (Section 3.3.5 describes some of the resulting problems.) On the other hand, names in most modern computer file systems are for practical purposes unlimited, and the user gets to choose them.
命名方案、名称、名称与值的绑定以及名称所绑定的值都可以具有不同的生存期。通常,名称和值本身都具有相当长的生存期,但将彼此关联的绑定则稍微短暂一些。因此,个人姓名和电话号码通常都是长期存在的,但当一个人搬到另一个城市时,电话公司通常会将该个人姓名绑定到新的电话号码,并在一段时间后将新的个人姓名绑定到旧电话号码。同样,应用程序及其使用的操作系统接口可能都是长期存在的,但每次程序运行时,连接它们的绑定可能会重新建立。每次启动程序时更新绑定使得可以独立更新应用程序和操作系统。再例如,命名网络服务(如PostOffice.gov)和网络附加点(如 Internet 地址 10.72.43.131)可能都是长期存在的,但当邮局发现需要将该服务移至另一台更可靠的计算机时,它们之间的绑定可能会发生变化,并将旧计算机重新分配给不太重要的服务。
A naming scheme, a name, the binding of that name to a value, and the value to which the name is bound can all have different lifetimes. Often, both names and values are themselves quite long-lived, but the bindings that relate one to the other are somewhat more transient. Thus, both personal names and telephone numbers are typically long-lived, but when a person moves to a different city, the telephone company will usually bind that personal name to a new telephone number and, after some delay, bind a new personal name to the old telephone number. In the same way, an application program and the operating system interfaces it uses may both be long-lived, but the binding that connects them may be established anew every time the program runs. Renewing the bindings each time the program is launched makes it possible to update the application program and the operating system independently. For another example, a named network service, such as PostOffice.gov, and a network attachment point, such as the Internet address 10.72.43.131, may both be long-lived, but the binding between them may change when the Post Office discovers that it needs to move that service to a different, more reliable computer, and it reassigns the old computer to a less important service.
当名称超过其绑定期限时,任何使用该名称的用户如果仍试图解析它,就会遇到悬空引用,即使用先前绑定的名称,解析结果要么为未找到的结果,要么为无关值。因此,一个旧电话号码打错了房子,或者导致出现“该号码已断开”的消息,这就是悬空引用的一个例子。当名称空间有限时,悬空引用几乎总是一个问题,因为必须重用有限名称空间中的名称。错误使用旧名称的对象可能会犯下严重错误,甚至会损坏现在具有该名称的无关对象(例如,如果该名称是物理内存地址)。在某些情况下,可以通过将名称视为需要验证的提示来处理悬空引用。因此,当查找远方城市失散多年的朋友的电话号码时,当有人接听该号码时,第一个问题是“你是在……上高中的詹姆斯·威尔逊吗?”
When a name outlives its binding, any user of that name that still tries to resolve it will encounter a dangling reference, which is a use of a previously bound name that resolves either to a not-found result or to an irrelevant value. Thus, an old telephone number that rings in the wrong house or leads to a message saying “that number has been disconnected” is an example of a dangling reference. Dangling references are nearly always a concern when the name space is limited because names from limited name spaces must be reused. An object that incorrectly uses old names may make serious mistakes and even cause damage to an unrelated object that now has that name (for example, if the name is a physical memory address). In some cases, it may be possible to deal with dangling references by considering names to be simply hints that require verification. Thus when looking up the telephone number of a long-lost friend in a distant city, the first question when someone answers the phone at that number is something such as “are you the James Wilson who attended high school in … ?”
当名称空间不受限制且名称从未被重用时,悬空引用仅影响那些由于某种原因已从其原值解除绑定的名称的用户。这些悬空引用的破坏性较小。例如,在文件系统中,间接名称是与某个其他(目标)文件系统名称绑定的名称。如果有人删除目标名称,则间接名称将成为悬空引用。由于未绑定的间接名称只会产生未找到的结果,因此它更可能成为麻烦而不是造成损害的根源。但是,如果有人意外或恶意地将目标名称重用于完全不同的文件,则间接名称的用户可能会大吃一惊。
When a name space is unlimited and names are never reused, dangling references affect only the users of names that have for some reason been unbound from their former values. These dangling references can be less disruptive. For example, in a file system, an indirect name is one that is bound to some other (target) file system name. The indirect name becomes a dangling reference if someone removes the target name. Because an unbound indirect name simply produces a not-found result, it is more likely to be a nuisance than a source of damage. However, if someone accidentally or maliciously reuses the target name for a completely different file, the user of the indirect name could be in for a surprise.
然而,当系统规模庞大或分布广泛时,一旦绑定并导出名称,就会在广泛分布的地方被发现和记住。这种分散性产生了对稳定绑定的需求。这种影响在万维网上尤为明显,万维网的设计鼓励创建对名称受他人控制的文档的交叉引用,结果交叉引用经常演变为悬垂引用。
When systems are large or distributed, however, a name, once bound and exported, tends to be discovered and remembered in widely dispersed places. That dispersion creates a need for stable bindings. This effect has been particularly noticed in the World Wide Web, whose design encourages the creation of cross-references to documents whose names are under someone else’s control, with the result that cross-references often evolve into dangling references.
悬垂引用有一个反义词:当一个对象的生命超过了所有与它绑定的名称时,该对象就变成了所谓的孤儿对象或丢失对象,因为没有人能再用名称引用它。丢失的对象可能是一个严重的问题,因为可能没有好的方法来回收它们占用的物理存储空间。如果一个系统经常以这种方式丢失对对象的跟踪,则称其存在存储泄漏。为了避免丢失对象,一些命名方案会跟踪每个对象的绑定数,当UNBIND操作导致该数字达到零时,系统将借此机会回收对象占用的存储空间。我们在第 2.5 节的案例研究中看到了这种用于链接的引用计数方案。它与跟踪垃圾收集形成对比,后者是某些编程语言中使用的一种替代技术,它涉及偶尔探索对象之间的命名连接,以查看哪些对象可以访问,哪些对象不能访问。第2.5 节中描述的UNIX文件系统对文件对象使用引用计数。
There is a converse to the dangling reference: when an object outlives every binding of a name to it, that object becomes what is known as an orphan or a lost object because no one can ever refer to it by name again. Lost objects can be a serious problem because there may be no good way to reclaim the physical storage they occupy. A system that regularly loses track of objects in this way is said to have a storage leak. To avoid lost objects, some naming schemes keep track of the number of bindings to each object, and, when an UNBIND operation causes that number to reach zero, the system takes the opportunity to reclaim the storage occupied by the object. We saw this reference counting scheme used for links in the case study in Section 2.5. It contrasts with tracing garbage collection, an alternative technique used in some programming languages that involves occasional exploration of the named connections among objects to see which objects can and cannot be reached. The UNIX file system, described in Section 2.5, uses reference counting for file objects.
在本章和上一章中,我们探讨了名称使用的基本原理和许多工程考虑因素,但我们只是轻描淡写地提到了名称在系统中的应用。名称是所有系统领域的基本构建块。展望未来,几乎每一章都会开发依赖于名称、名称空间和绑定使用的技术和方法:
In this and the previous chapter, we have explored both the underlying principles of, and many engineering considerations surrounding, the use of names, but we have only lightly touched on the applications of names in systems. Names are a fundamental building block in all system areas. Looking ahead, almost every chapter will develop techniques and methods that depend on the use of names, name spaces, and binding:
在具有客户端和服务的模块化系统中(第 4 章),客户端需要一种命名服务的方法。
In modularizing systems with clients and services (Chapter 4), clients need a way to name services.
在使用虚拟化的模块化系统中(第 5 章),虚拟内存是一个地址命名系统。
In modularizing systems with virtualization (Chapter 5), virtual memory is an address naming system.
在增强性能方面(第 6 章),缓存正在重命名设备。
In enhancing performance (Chapter 6), caches are renaming devices.
数据通信网络(第 7 章 [在线])使用名称来识别节点并将数据路由到它们。
Data communication networks (Chapter 7 [on-line]) use names to identify nodes and to route data to them.
在事务中(第 9 章 [在线]),经常需要“同时”修改几个不同的对象,这意味着所有更改似乎都发生在一个程序步骤中,这是原子性的一个例子。获得这种原子性的一种方法是将所有要更改的对象的副本临时分组为具有临时隐藏名称的复合对象,修改副本,然后将复合对象重新绑定到可见名称。这样,所有更改的组件都会同时显示出来。
In transactions (Chapter 9 [on-line]) it is frequently necessary to modify several distinct objects “at the same time”, meaning that all the changes appear to happen in a single program step, an example of atomicity. One way to obtain this form of atomicity is by temporarily grouping copies of all the objects that are to be changed into a composite object that has a temporary, hidden name, modifying the copies, and then rebinding the composite object to a visible name. In this way, all of the changed components are revealed simultaneously.
在安全性方面(第 11 章 [在线]),设计人员使用密钥,即从非常大且稀疏的地址空间中随机选择的名称。基本思想是,如果请求某物的唯一方法是通过名称,而您不知道也无法猜出其名称,则您无法请求它,因此它受到保护。
In security (Chapter 11 [on-line]), designers use keys, which are names chosen randomly from a very large and sparsely populated address space. The underlying idea is that if the only way to ask for something is by name, and you don’t know and can’t guess its name, you can’t ask for it, so it is protected.
上一章介绍的名称发现将在我们讨论信息保护和安全时再次出现。当一个用户试图识别或授予另一个指定用户权限时,了解该其他用户的真实名称至关重要。如果有人能诱骗您使用错误的名称,您可能会向不应该拥有权限的用户授予权限。该要求反过来意味着需要能够将名称发现过程追溯到某个终止直接通信步骤,验证直接通信是否以可信的方式进行(例如检查驾驶执照),并评估递归名称发现协议中其他每个步骤的信任程度。第 11 章 [在线] 将此问题描述为名称到密钥绑定问题。
Name discovery, which was introduced in the preceding chapter, will reappear when we discuss information protection and security. When one user either tries to identify or grant permission to another named user, it is essential to know the authentic name of that other user. If someone can trick you into using the wrong name, you may grant permission to a user who shouldn’t have it. That requirement in turn means that one needs to be able to trace the name discovery procedure back to some terminating direct communication step, verify that the direct communication took place in a credible fashion (such as examining a driver’s license), and also evaluate the amount of trust to place in each of the other steps in the recursive name discovery protocol. Chapter 11 [on-line] describes this concern as the name-to-key binding problem.
用户名的发现就是一个明显需要关注真实性的例子,但类似的真实性问题可以适用于任何名称绑定,尤其是在由许多用户共享或连接到网络的系统中。如果任何人都可以修改绑定,则该绑定的用户可能会犯错误,例如将机密信息发送给敌对方。第 11 章 [在线] 深入介绍了实现真实性的技术。用户互联网架构研究项目使用此类技术为移动设备提供基于物理会合和社交网络中的信任的安全的全球命名系统。有关更多详细信息,请参阅进一步阅读建议 3.2.5。
Discovery of user names is one example in which authenticity is clearly of concern, but a similar authenticity concern can apply to any name binding, especially in systems shared by many users or attached to a network. If anyone can tinker with a binding, a user of that binding may make a mistake, such as sending something confidential to a hostile party. Chapter 11 [on-line] addresses in depth techniques of achieving authenticity. The User Internet Architecture research project uses such techniques to provide a secure, global naming system for mobile devices based on physical rendezvous and the trust found in social networks. For more detail, see Suggestions for Further Reading 3.2.5.
名称的唯一性与安全性之间也存在着某种关系:如果有人能欺骗您对两个不同的东西使用同一个本应唯一的名称,您可能会犯下危及安全性的错误。主机标识协议通过创建 Internet 主机的名称空间来解决此问题,该名称空间受类似于第 11 章 [在线] 中描述的加密技术的保护。有关更多详细信息,请参阅 Internet 工程任务组征求意见 RFC 4423。
There is also a relation between uniqueness of names and security: If someone can trick you into using the same supposedly unique name for two different things, you may make a mistake that compromises security. The Host Identity Protocol addresses this problem by creating a name space of Internet hosts that is protected by cryptographic techniques similar to those described in Chapter 11 [on-line]. For more detail, see Internet Engineering Task Force Request for Comments RFC 4423.
这篇展望文章完成了我们对命名系统设计相关概念的介绍。本章的接下来两节将提供一个案例研究,研究万维网页面使用的相对复杂的命名方案,以及一系列战争故事,说明当命名概念未能得到充分的设计考虑时可能出现的问题。
This look ahead completes our introduction of concepts related to the design of naming systems. The next two sections of this chapter provide a case study of the relatively complex naming scheme used for pages of the World Wide Web, and a collection of war stories that illustrate what can go wrong when naming concepts fail to receive sufficient design consideration.
万维网 [参见进一步阅读建议 3.2.3 ] 是一个没有唯一根的命名网络,同一对象可能有许多不同的名称,并且上下文引用复杂。它的名称映射算法是几种不同组件名称映射算法的集合。让我们将它纳入命名模型。*
The World Wide Web [see Suggestions for Further Reading 3.2.3] is a naming network with no unique root, potentially many different names for the same object, and complex context references. Its name-mapping algorithm is a conglomeration of several different component name-mapping algorithms. Let’s fit it into the naming model.*
Web 有两层命名:上层是用户友好的;下层也是基于字符串,但更加机械化。
The Web has two layers of naming: an upper layer that is user-friendly, and a lower layer, which is also based on character strings, but is nevertheless substantially more mechanical.
在上层,网页看起来与任何其他带插图的文本页面一样,只是人们可能会注意到似乎有大量带下划线的单词,例如 Alice's page。这些带下划线的文本以及图形中的某些图标和区域是指向其他网页的超链接的标签。如果单击超链接,浏览器将检索并显示该其他网页。这就是用户的视图。浏览器对超链接的视图是,它是当前网页中用超文本标记语言 (HTML) 编写的字符串。以下是文本超链接的示例:
At the upper layer, a Web page looks like any other page of illustrated text, except that one may notice what seem to be an unusually large number of underlined words, for example, Alice’s page. These underlined pieces of text, as well as certain icons and regions within graphics, are labels for hyperlinks to other Web pages. If you click on a hyperlink, the browser will retrieve and display that other Web page. That is the user’s view. The browser’s view of a hyperlink is that it is a string in the current Web page written in HyperText Markup Language (HTML). Here is an example of a text hyperlink:
<a href=“ http://web.pedantic.edu/Alice/www/home.html ”> Alice 的页面</a>
<a href=“http://web.pedantic.edu/Alice/www/home.html”>Alice’s page</a>
在此超链接中,引号之间隐藏着一个统一资源定位符(或用网络术语来说,是一个URL ),在此示例中,它是位于较低命名层的另一个网页的名称。我们可以将超链接视为将名称(带下划线的标签)绑定到一个值(URL),而该值本身是 URL 命名空间中的一个名称。由于上下文是一组名称与值的绑定,因此任何包含超链接的页面都可以被视为上下文,尽管不是简单的表查找类型。相反,名称映射算法是在用户脑海中进行的,将想法和概念与各种超链接标签、图标和图形相匹配。用户通常不是通过键入路径名来遍历这个命名网络,而是通过单击选定的对象。在这个命名网络中,URL 充当由 URL 获取的页面中的链接的上下文引用。
Nestled inside this hyperlink, between the quotation marks, is a Uniform Resource Locator or, in Webspeak, a URL, which in the example is the name of another Web page at the lower naming layer. We can think of a hyperlink as binding a name (the underlined label) to a value (the URL) that is itself a name in URL name space. Since a context is a set of bindings of names to values, any page that contains hyperlinks can be thought of as a context, albeit not of the simple table-lookup variety. Instead, the name-mapping algorithm is one carried on in the mind of the user, matching ideas and concepts to the various hyperlink labels, icons, and graphics. The user does not usually traverse this naming network by typing path names, but rather by clicking on selected objects. In this naming network, a URL plays the role of a context reference for the links in the page fetched by the URL.
为了检索万维网上的页面,您需要它的 URL。许多 URL 可以在其他网页的超链接中找到,如果您碰巧知道其中一个网页的 URL,这会很有帮助,但必须有一个起点。大多数 Web 浏览器都带有一个或多个内置网页,其中包含浏览器制造商的 URL 以及 Web 中的其他一些有用的起点。这是开始发现名称的一种方法。另一种发现名称的方式是查看报纸广告中提到的 URL。
In order to retrieve a page in the World Wide Web, you need its URL. Many URLs can be found in hyperlinks on other Web pages, which helps if you happen to know the URL of one of those Web pages, but somewhere there must be a starting place. Most Web browsers come with one or more built-in Web pages that contain the URL of the browser maker plus a few other useful starting points in the Web. This is one way to get started on name discovery. Another form of name discovery is to see a URL mentioned in a newspaper advertisement.
在上面的示例超链接中,我们有一个绝对 URL,这意味着该 URL 带有其自己完整、明确的上下文引用:
In the example hyperlink above, we have an absolute URL, which means that the URL carries its own complete, explicit context reference:
http://web.pedantic.edu/Alice/www/home.html
http://web.pedantic.edu/Alice/www/home.html
URL 的名称映射算法分为以下几个步骤。
The name-mapping algorithm for a URL works in several steps, as follows.
1.浏览器提取冒号之前的部分(此处为http),将其视为要使用的网络协议的名称,并使用存储在浏览器中的表查找上下文将该名称解析为协议处理程序。该上下文的名称内置于浏览器中。URL 其余部分的解释取决于协议处理程序。其余步骤描述了超文本传输协议 (http) 处理程序的情况的解释。
1. The browser extracts the part before the colon (here, http), considers it to be the name of a network protocol to use, and resolves that name to a protocol handler using a table-lookup context stored in the browser. The name of that context is built in to the browser. The interpretation of the rest of the URL depends on the protocol handler. The remaining steps describe the interpretation for the case of the hypertext transfer protocol (http) handler.
2.浏览器获取//和后面的/之间的部分(在我们的示例中,该部分为web.pedantic.edu)并请求 Internet 域名系统 (DNS) 对其进行解析。DNS 返回的值是 Internet 地址。第 4.4 节是 DNS 的案例研究,详细描述了此解析的工作原理。
2. The browser takes the part between the // and the following / (in our example, that would be web.pedantic.edu) and asks the Internet Domain Name System (DNS) to resolve it. The value that DNS returns is an Internet address. Section 4.4 is a case study of DNS that describes in detail how this resolution works.
3.浏览器使用步骤 1 中找到的协议打开与该 Internet 地址的服务器的连接,并且作为该协议的第一步,它将 URL 的剩余部分/Alice/www/home.html发送到服务器。
3. The browser opens a connection to the server at that Internet address, using the protocol found in step 1, and as one of the first steps of that protocol it sends the remaining part of the URL, /Alice/www/home.html, to the server.
4. The server looks for a file in its file system that has that path name.
5.如果步骤 4 的名称解析成功,则服务器将具有该路径名的文件发送给客户端。客户端将该文件转换为适合显示的页面。
5. If the name resolution of step 4 is successful, the server sends the file with that path name to the client. The client transforms the file into a page suitable for display.
(某些 Web 服务器执行额外的名称解析步骤。第 3.3.4 节中的讨论描述了一个示例。)
(Some Web servers perform additional name resolution steps. The discussion in Section 3.3.4 describes an example.)
服务器发送的页面可能包含其自己的超链接,例如:
The page sent by the server might contain a hyperlink of its own such as the following:
<a href="contacts.html">如何联系 Alice。</a>
<a href="contacts.html">How to contact Alice.</a>
在这种情况下,URL(同样,引号之间的部分)不带有自己的上下文。这种缩写的 URL 称为相对URL或部分URL。浏览器被要求解释此名称,并且为了继续,它必须提供默认上下文。URL 规范要求从浏览器找到此超链接的页面的 URL 中派生出一个上下文,并假设此超文本链接应在与找到它的页面相同的上下文中进行解释。因此,它采用原始 URL 并将其最后一个组件(home.html)替换为部分 URL,从而获得
In this case the URL (again, the part between the quotation marks) does not carry its own context. This abbreviated URL is called a relative or partial URL. The browser has been asked to interpret this name, and in order to proceed it must supply a default context. The URL specification says to derive a context from the URL of the page in which the browser found this hyperlink, assuming somewhat plausibly that this hypertext link should be interpreted in the same context as the page in which it was found. Thus it takes the original URL and replaces its last component (home.html) with the partial URL, obtaining
http://web.pedantic.edu/Alice/www/contacts.html
http://web.pedantic.edu/Alice/www/contacts.html
然后,它对这个新构造的绝对 URL 执行标准名称映射算法,并且它应该在 Alice 的www目录中找到所需的页面。
It then performs the standard name-mapping algorithm on this newly fabricated absolute URL, and it should expect to find the desired page in Alice’s www directory.
页面可以通过提供称为基本元素的内容(例如<base href="some absolute URL">)来覆盖此默认上下文。基本元素中的绝对 URL 是上下文引用,用于解析在包含基本元素的页面上找到的任何部分 URL。
A page can override this default context by providing something called a base element (e.g., <base href="some absolute URL">). The absolute URL in the base element is a context reference to use in resolving any partial URL found on the page that contains the base element.
Web 命名算法涉及多种命名方案,这一点从 URL 的某些部分区分大小写而其他部分不区分大小写就可以看出。结果可能令人相当困惑。统一资源定位器 (URL) 的主机名部分由 Internet 域名系统解释,而该系统不区分大小写。URL 的其余部分则是另一回事。协议名称部分由客户端浏览器解释,是否区分大小写取决于其实现。(检查以“HTTP://”开头的 URL 是否适用于您最喜欢的 Web 浏览器。)Firefox 的 Macintosh 实现以保留大小写的方式处理协议名称,但现已过时的 Internet Explorer 的 Macintosh 实现是强制大小写的。
Multiple naming schemes are involved in the Web naming algorithm, as is clear by noticing that some parts of a URL are case sensitive and other parts are not. The result can be quite puzzling. The host name part of a Uniform Resource Locator (URL) is interpreted by the Internet Domain Name System, which is case-insensitive. The rest of the URL is a different matter. The protocol name part is interpreted by the client browser, and case sensitivity depends on its implementation. (Check to see if a URL starting with "HTTP://" works with your favorite Web browser.) The Macintosh implementation of Firefox treats the protocol name in a case-preserving fashion, but the now-obsolete Macintosh implementation of Internet Explorer is case-coercing.
更有趣的大小写敏感性问题出现在主机名之后。Web 指定服务器应使用依赖于协议的方案来解释 URL 的这一部分。对于 HTTP 协议,URL 规范坚持认为此字符串不是UNIX文件名,但未提及大小写敏感性。实际上,大多数系统将此字符串解释为其文件系统中的路径名,因此大小写敏感性取决于服务器的文件系统。因此,如果服务器运行的是标准UNIX系统,则路径名区分大小写,而如果服务器是标准 Macintosh,则路径名保留大小写。有些示例甚至将情况进一步混淆;第 3.3.4 节描述了一个这样的例子。
The more interesting case-sensitivity questions come after the host name. The Web specifies that the server should interpret this part of the URL using a scheme that depends on the protocol. In the case of the HTTP protocol, the URL specification is insistent that this string is not a UNIX file name, but it is silent on case sensitivity. In practice, most systems interpret this string as a path name in their file system, so case-sensitivity depends on the file system of the server. Thus if the server is running a standard UNIX system, the path name is case-sensitive, while if the server is a standard Macintosh, the path name is case-preserving. There are examples that mix things up even further; Section 3.3.4 describes one such example.
将 URL 路径名解释为服务器文件系统的路径名的做法可能会导致意外的惊喜。如前所述,Web 浏览器为网页中找到的相对名称(即部分 URL)提供默认上下文引用。它提供的默认上下文引用只是浏览器用来检索包含相对名称的页面的 URL,截断回最后一个斜杠字符。此上下文引用是服务器上应该用于解析相对名称(第一个组件)的目录的名称。
The practice of interpreting URL path names as path names of the server’s file system can result in unexpected surprises. As described earlier, the Web browser supplies a default context reference for relative names (that is, partial URLs) found in Web pages. The default context reference it supplies is simply the URL that the browser used to retrieve the page that contained the relative name, truncated back to the last slash character. This context reference is the name of a directory at the server that should be used to resolve the (first component of) the relative name.
某些服务器通过简单地使用本地(例如UNIX)文件系统名称空间来提供 URL 名称空间。当本地文件系统名称空间允许目录名称的同义词(符号链接和第4.5 节中描述的网络文件系统挂载是两个示例)时,本地文件系统名称空间到 URL 名称空间的映射不是唯一的。因此,几个不同的 URL 可以对同一对象具有不同的路径名。例如,假设有一个UNIX文件系统,其中有一个名为/alice/home.html的符号链接,它实际上是对名为/alice/www/home.html的文件的间接引用。在这种情况下,URL
Some servers provide a URL name space by simply using the local (for example, UNIX) file system name space. When the local file system name space allows synonyms (symbolic links and the Network File System mounts described in Section 4.5 are two examples) for directory names, the mapping of local file system name space to the URL name space is not unique. Thus, several different URLs can have different path names for the same object. For example, suppose that there is a UNIX file system with a symbolic link named /alice/home.html that is actually an indirect reference to the file named /alice/www/home.html. In that case, the URLs
1 < http://web.pedantic.edu/alice/home.html >
1 <http://web.pedantic.edu/alice/home.html>
和
and
2 < http://web.pedantic.edu/alice/www/home.html >
2 <http://web.pedantic.edu/alice/www/home.html>
引用同一个文件。当具有多个 URL 的对象是目录(其名称用作上下文引用)时,可能会出现问题。继续这个例子,假设文件home.html包含超链接<a href = “contacts.html”>。home.html和contacts.html都存储在目录/alice/www中。进一步假设浏览器使用上面的 URL 1获得了home.html。
refer to the same file. Trouble can arise when the object that has multiple URLs is a directory whose name is used as a context reference. Continuing the example, suppose that file home.html contains the hyperlink <a href = “contacts.html”>. Both home.html and contacts.html are stored in the directory /alice/www. Suppose further that the browser obtained home.html by using the URL 1 above.
现在,用户点击包含部分 URL contacts.html的超链接,要求浏览器解析它。按照通常的程序,浏览器通过截断原始 URL 来实现默认上下文引用,以获得:
Now, the user clicks on the hyperlink containing the partial URL contacts.html, asking the browser to resolve it. Following the usual procedure, the browser materializes a default context reference by truncating the original URL to obtain:
http://web.pedantic.edu/alice/
http://web.pedantic.edu/alice/
然后通过连接部分 URL 将此名称用作上下文:
and then uses this name as a context by concatenating the partial URL:
http://web.pedantic.edu/alice/contacts.html
http://web.pedantic.edu/alice/contacts.html
此 URL 可能会产生未找到响应,因为我们要查找的文件实际上具有路径名/alice/www/contacts.html。或者更糟糕的是,此请求可能会返回目录/alice中恰好名为contacts.html的另一个文件。如果具有相同名称的不同文件最终是当前contacts.html的过期副本,则可能会加剧混乱。另一方面,如果用户最初使用 URL 2,则浏览器将检索名为/alice/www/contacts.html的文件,正如网页设计者所期望的那样。
This URL will probably produce a not-found response because the file we are looking for actually has the path name /alice/www/contacts.html. Or worse, this request could return a different file that happens to be named contacts.html in the directory /alice. The confusion may be compounded if the different file with the same name turns out to be an out-of-date copy of the current contacts.html. On the other hand, if the user originally used URL 2, the browser would retrieve the file named /alice/www/contacts.html, as the Web page designer expected.
在解释相对名称“..”时也会出现类似的问题。此名称通常为当前目录的父目录名称。UNIX 系统提供语义解释:在当前目录中查找名称“..”,按照惯例,它会(在 inode 名称空间中)评估父目录。相比之下,Web 指定“..”是一个语法信号,表示“通过丢弃路径名中最重要的组件来修改默认上下文引用”。尽管对“..”的解释截然不同,但结果通常是相同的,因为对象的父级通常是该对象路径名的下一个较早组件所命名的事物。当 Web 的语法修改规则应用于具有目录间接名称组件的路径名时,就会出现例外(和问题)。如果 URL 中的路径名不遍历目录的父级,则“..”的语法解释会创建一个与语义解释提供的默认上下文引用不同的默认上下文引用。
A similar problem can arise when interpreting the relative name “..”. This name is, conventionally, the name for the parent directory of the current directory. The UNIX system provides a semantic interpretation: look up the name “..” in the current directory, where by convention it evaluates (in inode name space) to the parent directory. In contrast, the Web specifies that “..” is a syntactic signal that means “modify the default context reference by discarding the least significant component of the path name.” Despite these drastically different interpretations of “..”, the result is usually the same because the parent of an object is usually the thing named by the next-earlier component of that object’s path name. The exception (and the problem) arises when the Web’s syntactic modification rule is applied to a path name with a component that is an indirect name for a directory. If the path name in the URL does not traverse the directory’s parent, syntactic interpretation of “..” creates a default context reference different from the one that would be supplied by semantic interpretation.
假设在我们的示例中,文件home.html包含超链接<a href="../phone.html"> 。如果通过 URL 1到达 home.html 的用户点击此超链接,浏览器将截断该 URL 并将其与部分 URL 连接起来,以获得
Suppose, in our example, that the file home.html contains the hyperlink <a href="../phone.html">. If the user who reached home.html via URL 1 clicks on this hyperlink, the browser will truncate that URL and concatenate it with the partial URL, to obtain
http://web.pedantic.edu/alice/../phone.html
http://web.pedantic.edu/alice/../phone.html
然后使用“ .. ”的语法解释来生成 URL
and then use the syntactic interpretation of “..” to produce the URL
http://web.pedantic.edu/phone.html
http://web.pedantic.edu/phone.html
另一个不存在的文件。同样,如果用户从 URL 2开始,那么“ .. ”的语法解释结果将是请求文件
another non-existent file. Again, if the user had started with URL 2, the result of syntactic interpretation of “..” would be to request the file
http://web.pedantic.edu/alice/phone.html
http://web.pedantic.edu/alice/phone.html
正如最初打算的那样。
as originally intended.
这个问题至少可以用三种不同的方法来解决:
This problem could be fixed in at least three different ways:
1. Arrange things so that the default context reference always works.
A。始终在保存引用页面的目录中安装指向引用页面的UNIX链接。(或者根本不使用UNIX链接。)
a. Always install a UNIX link to the referenced page in the directory that held the referring page. (Or never use UNIX links at all.)
2. Do a better of job of choosing a default context reference.
有人可能会认为,服务器的实施者(或包含相对链接的页面的编写者)未能注意到 Web URL 规范中有关路径名的以下警告: “与 unix 和其他磁盘操作系统文件名约定的相似性应被视为纯属巧合,而不应被理解为表明 URI 应被解释为文件名。”
One might suggest that the implementer of the server (or the writer of the pages containing the relative links) failed to heed the following warning in the Web URL specification* for path names: “The similarity to unix and other disk operating system filename conventions should be taken as purely coincidental, and should not be taken to indicate that URIs should be interpreted as file names.”
从技术上讲,这个警告是正确的,但建议却具有误导性。不幸的是,这个问题是 Web 命名规范中固有的。这些规范要求相对名称按语法解释,但它们并不要求每个对象都有唯一的 URL。相对名称的明确语法解释要求上下文引用是唯一的路径名。由于浏览器从包含相对名称的对象的路径名派生上下文引用,而该对象的路径名不必是唯一的,因此相对名称的语法解释本质上将是模糊的。当服务器尝试将 URL 路径名映射到不唯一的UNIX路径名时,最好将它们描述为暴露问题,而不是导致问题。
This warning is technically correct, but the suggestion is misleading. Unfortunately, the problem is built in to the Web naming specifications. Those specifications require that relative names be interpreted syntactically, yet they do not require that every object have a unique URL. Unambiguous syntactic interpretation of relative names requires that the context reference be a unique path name. Since the browser derives the context reference from the path name of the object that contained the relative name, and that object’s path name does not have to be unique, it follows that syntactic interpretation of relative names will intrinsically be ambiguous. When servers try to map URL path names to UNIX path names, which are not unique, they are better characterized as exposing, rather than causing, the problem.
分析表明,解决这个问题的一个方法是改变浏览器获取上下文引用的方式。如果浏览器能以某种方式获取上下文引用的规范路径名,即 UNIX 系统从根目录到达目录时使用的规范路径名,那么问题就会消失。
That analysis suggests that one way to conquer the problem is to change the way in which the browser acquires the context reference. If the browser could somehow obtain a canonical path name for the context reference, the same canonical path name that the UNIX system uses to reach the directory from the root, the problem would vanish.
有时,你会看到这样的 URL
Occasionally, one will encounter a URL that looks something like
http://www.amazon.com/exec/obidos/ASIN/0670672262/o/qid=921285151/sr=2–2/002–7663546–7016232
http://www.amazon.com/exec/obidos/ASIN/0670672262/o/qid=921285151/sr=2–2/002–7663546–7016232
也许
or perhaps
http://www.google.com/search?hl=en&q=books+about+systems&btnG=Google+Search&aq=f&oq=
http://www.google.com/search?hl=en&q=books+about+systems&btnG=Google+Search&aq=f&oq=
这里有两个名称重载的精彩示例。第一个例子是购物服务。由于服务器不能依赖客户端维护除当前显示的网页的 URL 之外的任何有关此购物会话的状态,因此服务器已将购物会话的状态以服务器上状态维护文件的标识符的形式编码在 URL 的路径名部分中。
Here we have two splendid examples of overloading of names. The first example is of a shopping service. Because the server cannot depend on the client to maintain any state about this shopping session other than the URL of the Web page currently being displayed, the server has encoded the state of the shopping session, in the form of an identifier of a state-maintaining file at the server, in the path name part of the URL.
第二个示例是搜索服务;浏览器已将用户的搜索查询编码到其提交的 URL 的路径名部分中。此处的提示是名称中间的问号,这是一种语法技巧,用于提醒服务器,问号之前的字符串是要启动的程序的名称,而问号之后的字符串是要提供给该程序的参数。要了解www.google.com如何处理此类查询,请参阅进一步阅读建议 3.2.4。
The second example is of a search service; the browser has encoded the user’s search query into the path name part of the URL it has submitted. The tip-off here is the question mark in the middle of the name, which is a syntactic trick to alert the server that the string up to the question mark is the name of a program to be launched, while the string after the question mark is an argument to be given to that program. To see what processing www.google.com does to respond to such a query, see Suggestions for Further Reading 3.2.4.
许多 URL 中还有另一种形式的重载:它们将计算机站点的名称与文件路径名连接在一起,这两者都不是特别稳定的标识符。考虑以下地震信息服务的名称:
There is another form of overloading in many URLs: they concatenate the name of a computer site with a path name of a file, neither of which is a particularly stable identifier. Consider the following name for an earthquake information service:
http://HPserver14.pedantic.edu/disk05/science/geophysics/quakes.html
http://HPserver14.pedantic.edu/disk05/science/geophysics/quakes.html
如果 HP 计算机被 Sun 服务器取代,如果文件服务器移至disk04,如果地球物理系更名为“地质系”或从科学学院迁出,或者地震服务器的责任移至学术研究所,则此名称可能会发生变化。此示例的 URL 经常无法解析,即使它最初指向的页面仍然在某个地方,可能已移至其他站点或只是移至原始站点的其他目录。
This name is at risk of change if the HP computer is replaced by a Sun server, if the file server is moved to disk04, if the geophysics department is renamed “geology” or moves out of the school of science, or if the responsibility for the earthquake server moves to the Institute for Scholarly Studies. A URL such as this example frequently turns out to be unresolvable, even though the page it originally pointed to is still out there somewhere, perhaps having moved to a different site or simply to a different directory at the original site.
避免在指向某个站点的 URL 中捕获该站点的名称的一种方法是选择一个服务名称,并安排 DNS 将该服务名称绑定为该站点的间接名称。然后,如果需要将网站移动到另一台计算机,只需更改服务名称的绑定即可使旧 URL 继续工作。同样,可以通过明智地使用间接文件名来避免在 URL 中捕获过载路径名。因此,名称
One way to avoid trapping the name of a site in the URLs that point to it is to choose a service name and arrange for DNS to bind that service name as an indirect name for the site. Then, if it becomes necessary to move the Web site to a different computer, a change to the binding of the service name is all that is needed for the old URLs to continue working. Similarly, one can avoid trapping an overloaded path name in a URL by judicious use of indirect file names. Thus the name
http://quake.org/library/quakes.html
http://quake.org/library/quakes.html
可以引用同一个网页,但它可以在各种各样的变化中保持稳定。
could refer to the same Web page, yet it can remain stable through a wide variety of changes.
人们投入了大量精力来发明一种 URL 的替代品,这种替代品的过载更少,因此在服务器站点和文件系统结构发生变化时更加稳健。已经提出了几种系统:永久 URL (PURL)、通用资源名称(URN)、数字对象标识符(DOI) ®和handle。到目前为止,这些提议都没有得到足够广泛的采用来取代 URL。
Considerable intellectual energy has been devoted to inventing a replacement for the URL that has less overloading and is thus more robust in the face of changes of server site and file system structure. Several systems have been proposed: Permanent URL (PURL), Universal Resource Name (URN), Digital Object Identifier (DOI)®, and handle. To date, none of these proposals has yet achieved wide enough adoption to replace the URL.
尽管设计命名方案似乎很简单,但同时满足所有必要要求却出奇地困难。以下是在部署命名方案时发现的奇怪且有时令人惊讶的结果的几个示例。
Although designing a naming scheme seems to be a straightforward exercise, it is surprisingly difficult to meet all of the necessary requirements simultaneously. The following are several examples of strange and sometimes surprising results that have been noticed in deployed naming schemes.
西海岸的一所大学提供了一个“可视化班级列表” Web 界面,教师可以使用该界面获取某一班级特定班次的所有学生的姓名和照片。2004 年秋季教学学期开始时,教师注意到他们的班级中有同一个人的多张照片。人们可能认为一个班次包含一组三胞胎,但不包括三十胞胎。
A west coast university provides a “visual class list” Web interface that instructors can use to obtain the names and photos of all the students enrolled in a particular section of a class. At the beginning of the fall 2004 teaching term, instructors noticed that their classes had several photographs of the same individual. One might believe a section includes a set of triplets, but not triskaidekatuplets.
问题出在哪里:当学生没有可用的照片时,系统会插入一张笑脸图像,并显示“没有可用的照片”。系统设计者将图像存储在名为“smiley.jpg”的文件中。那年秋天,一位姓 Smiley 的新生注册了用户名“smiley”。正如人们所料,这位新生的照片被命名为“smiley.jpg”,并成为“没有可用的照片”图像。
What went wrong: When there is no picture available for a student, the system inserts an image of a smiley face with the words “No picture available”. The system designer stored the image in a file named “smiley.jpg”. That fall a new freshman whose last name was Smiley registered the user name “smiley”. As one might expect, the freshman’s photograph was named “smiley.jpg”, and it became the “No picture available” image.
互联网邮箱名称(例如Alice@Awesome.net)可视为由两个部分组成的地址。@ 符号之前的部分标识特定邮箱,@ 符号之后的部分是互联网域名,用于标识提供该邮箱的互联网服务提供商 (ISP)。当两个 ISP(例如,Awesome.net和Awful.net)合并时,其中一个(有时是两个)的客户通常会收到一封信,告知他们其邮箱地址(其中包含其前 ISP 名称的某种表示)将必须更改。新 ISP 可能会自动转发发送到旧地址的邮件,或者可能要求用户将新邮箱地址通知其所有通信者。更改的原因是旧邮箱名称的第二个部分已载有商标。新提供商不想继续使用该旧商标,而旧提供商可能不想看到新提供商使用该商标。
Internet mailbox names such as Alice@Awesome.net can be viewed as two-component addresses. The component before the @-sign identifies a particular mailbox, and the component after the @-sign is an Internet domain name that identifies the Internet service provider (ISP) that provides that mailbox. When two ISPs (say, Awesome.net and Awful.net) merge, the customers of one of them (and sometimes both) typically receive a letter telling them that their mailbox address, which contained some representation of the name of their former ISP, will have to change. The new ISP may automatically forward mail addressed to the old address, or it may require that the user notify all of his or her correspondents of the new mailbox address. The reason for the change is that the second component of the old mailbox name was overloaded with a trademark. The new provider does not want to continue using that old trademark, and the old provider may not want to see the trademark used by the new provider.
Alice 还可能会失望地发现,她的邮箱域名不仅从Awesome.net变为了Awful.net,而且在Awful.net的邮箱名称空间中,另一个客户已经占用了个人邮箱名称Alice,因此她甚至可能不得不选择一个新的个人邮箱名称,例如Alice24。
Alice may also find, to her disappointment, that not only does the domain name of her mailbox change from Awesome.net to Awful.net, but that in Awful.net’s mailbox name space, another customer has already captured the personal mailbox name Alice, so she may even have to choose a new personal mailbox name, such as Alice24.
随着互联网的发展,一些 ISP 繁荣起来,而另一些则没有,因此出现了许多合并和收购。电子邮件服务提供商名称的脆弱性催生了间接域名市场。这个市场的客户是需要稳定电子邮件地址的用户,例如经营私营企业或拥有大量通信者的人。间接名称提供商将收取年费,注册一个新域名,例如Alice.com,并配置 DNS 名称服务器,使邮箱名称Alice@Alice.com成为Alice@Awesome.Net的同义词。然后,在收到 ISP 合并通知后,Alice 只需要求间接名称提供商将邮箱名称Alice@Alice.com重新绑定到Alice24@Awful.net,她的通信者不必知道发生了什么。
As the Internet grows, some ISPs have prospered and others have not, so there have been many mergers and buyouts. The resulting fragility of e-mail service provider names has created a market for indirect domain names. The customers in this market are users who require a stable e-mail address, such as people who run private businesses or who have a large number of correspondents. For an annual fee, an indirect name provider will register a new domain name, such as Alice.com, and configure a DNS name server so that the mailbox name Alice@Alice.com becomes a synonym for Alice@Awesome.Net. Then, upon being notified of the ISP merger, Alice simply asks the indirect name provider to rebind the mailbox name Alice@Alice.com to Alice24@Awful.net, and her correspondents don’t have to know that anything happened.
美国邮政局按等级分配邮政投递代码(称为邮政编码),以便在路由邮件时利用等级。邮政编码有五位数字。第一位数字标识 10 个国家区域之一;新英格兰是 0 区,加利福尼亚、华盛顿和俄勒冈组成 9 区。接下来的两位数字标识一个区域。马萨诸塞州波士顿的南站邮政分局是 021 区的总部。所有以这三位数字开头的邮政编码都在该区域中心对邮件进行分类。以 024 开头的邮政编码标识马萨诸塞州沃尔瑟姆的区域。邮政编码的最后两位数字标识特定的邮局(称为站),例如马萨诸塞州瓦班,02468。邮政编码还可以有四位附加数字(称为 Zip + 4),用于将邮件按每个邮递员的投递顺序分类。虽然它们是数字,但相邻的邮政编码不一定分配给相邻的站或相邻的区域,因此它们实际上是名称而不是物理地址。尽管不能解释为物理地址,但这些名称却包含了大量路由信息。
The United States Post Office assigns postal delivery codes, called Zip codes, hierarchically, so that it can take advantage of the hierarchy in routing mail. Zip codes have five digits. The first digit identifies one of 10 national areas; New England is area 0 and California, Washington, and Oregon comprise area 9. The next two digits identify a section. The South Station Postal Annex in Boston, Massachusetts, is the headquarters of section 021. All Zip codes beginning with those three digits have their mail sorted at that sectional center. Zip codes beginning with 024 identify the Waltham, Massachusetts, section. The last two digits of the Zip code identify a specific post office (known as a station), such as Waban, Massachusetts, 02468. Zip codes can also have four appended digits (called Zip + 4) that are used to sort mail into delivery order for each mail carrier. Although they are numerical, adjacent zip codes are not necessarily assigned to adjacent stations or adjacent sections, so they are really names rather than physical addresses. Despite not being interpretable as physical addresses, these names are overloaded with routing information.
尽管路由是分层的,但 10 个国家区域似乎没有路由意义;一切都按部分进行。据报道,如果您走进波士顿的南站邮政分局,您会发现全国的外发邮件被分类到 999 个箱子中,每个分局中心一个。此外,对于邮政编码以 021 开头的邮件(即在南站区内),有 99 个箱子,每个分局内的一个。外发箱中的邮件被放入袋子中,每个袋子包含一个分局的邮件。然后,例如,所有发往南加州分局的袋子都装上同一辆卡车运往机场,然后登上飞往洛杉矶的飞机。当它们在洛杉矶下飞机时,它们被装上不同的卡车,开往南加州的各个分局。021 区的 99 个箱子中的邮件也放入袋子中,每个袋子发往 021 区内的不同邮局。
Although routing is hierarchical, apparently the 10 national areas have no routing significance; everything is done by section. It is reported that if you walk into the South Station Postal Annex in Boston, you will find that outgoing mail is being sorted into 999 bins, one for each sectional center, nationwide. In addition, for mail addressed with Zip codes beginning with 021 (that is, within the South Station section) there are 99 bins, one for each station within the section. The mail in the outgoing bins goes into bags, with each bag containing mail for one section. Then all the bags for Southern California sections, for example, go into the same truck to the airport, where they go onto a plane to Los Angeles. As they come off the plane in Los Angeles, they are loaded onto different trucks that go to the various Southern California sections. The mail in the 99 bins for section 021 also goes into bags, with each bag destined for a different post office within the 021 section.
从邮局寄出并寄往同一邮局的邮件仍将送至分区中心进行分拣,因为个别邮局没有能够按投递顺序排列邮件的自动分拣机。过去,所有邮件都送至分区中心的规则有很多例外,但这些年来例外情况已逐渐减少。
Mail that originates at a post office and is destined for the same post office still goes to the sectional center for sorting because individual post offices don’t have the automatic sorting machines that can put things into delivery order. There used to be many exceptions to the rule that all mail goes to a sectional center, but the number of exceptions has been gradually reduced over the years.
20 世纪 90 年代末,南站邮政分局处理的邮件量开始超过其容量,邮局决定将该部门的部分工作转移到马萨诸塞州沃尔瑟姆的新部门。由于邮政编码的前三位数字包含过多的路由信息,为了实现这一变化,邮局宣布,大约一半以 021 开头的邮政编码将于 1998 年 7 月 1 日更改为 024。结果正如人们所预料的那样,相当混乱。邮局试图与大型邮寄公司合作,让他们自动更新地址记录,但很快就出现了问题。
When the volume of mail handled by the South Station Postal Annex began to exceed its capacity in the late 1990s, the Post Office decided to transfer part of that section’s work to the newer Waltham, Massachusetts, section. Since the first three digits of the Zip code are overloaded with routing information, to accomplish this change it announced that about half of the Zip codes that began with 021 would, on July 1, 1998, change to 024. The result, as one might expect, was rather chaotic. The Post Office tried to work with large mailers to have them automatically update their address records, but loose ends soon appeared.
例如,信用卡公司美国运通 (American Express) 在其邮件标签打印系统中安装了邮政编码转换器,这样其账单将直接发送到沃尔瑟姆部门,但它并没有改变其内部客户地址记录,因为其计算机系统将所有地址更改标记为“移动”,这会影响验证程序和信用评级。因此,美国运通邮寄的所有邮件都正确写明了地址,但其内部记录仍保留了旧的邮政编码。
For example, American Express, a credit card company, installed a Zip code translator in its mail label printing system, so that its billing statements would go directly to the Waltham section, but it did not change its internal customer address records because its computer system flags all address changes as “moves”, which affect verification procedures as well as credit ratings. So everything that American Express mailed was addressed properly, but their internal records retained the old Zip codes.
现在问题来了:一些互联网供应商不接受信用卡,除非送货地址与信用卡地址相同。客户开始遇到这样的情况:互联网供应商拒绝邮政编码 02168,认为它是无效的送货地址,而美国运通拒绝邮政编码 02468,因为它与客户记录不符。当这种情况出现时,如果没有人为干预,就无法完成购买。
Now comes the problem: some Internet vendors will not accept a credit card unless the shipping address is identical to the credit card address. Customers began to encounter situations in which the Internet vendor rejected the Zip code 02168 as being an invalid delivery address, and American Express rejected the Zip code 02468 because it did not match its customer record. When this situation arose, it was not possible to complete a purchase without human intervention.
尽管供应商检查发现 02168 无效,但几年来,以该邮政编码为地址的邮件仍能正确投递到瓦班的地址。只是多花了一天时间才投递,因为它先被送到南站邮政分局,后者只是将其转发到沃尔瑟姆分区中心。重新命名并不是因为邮局的邮政编码用完了,而是因为其中一个分区中心的分拣能力已经超出。
Despite the vendor check that identifies 02168 as invalid, mail addressed with that Zip code continued to be correctly delivered to addresses in Waban for several years. It just took an extra day to be delivered because it went first to the South Station Postal Annex, which simply forwarded it to the Waltham sectional center. The renaming was done not because the post office was running out of Zip codes, but rather because the sorting capacity of one of its sectional centers was exceeded.
尽管如第 128 页所述,UNIX系统将区分大小写的文件名传播到许多其他文件系统,但并非所有广泛使用的命名方案都区分大小写。互联网通常是保留大小写的。例如,在第4.4 节中描述的互联网域名系统中,可以打开到cse.pedantic.edu或CSE.Pedantic.edu的网络连接;两者都指向同一目的地。互联网邮件系统也被指定为保留大小写,因此您可以向alice@pedantic.edu "、Alice@pedantic.edu和aLiCe@pedantic.edu发送邮件,所有三封邮件都应发送到同一邮箱。
Even though, as described on page 128, the UNIX system propagated case-sensitive file names to many other file systems, not all widely used naming schemes are case-sensitive. The Internet generally is case-preserving. For example, in the Internet Domain Name System described in Section 4.4, one can open a network connection to cse.pedantic.edu or to CSE.Pedantic.edu; both refer to the same destination. The Internet mail system is also specified to be case-preserving, so you can send mail to alice@pedantic.edu", Alice@pedantic.edu, and aLiCe@pedantic.edu, and all three messages should go to the same mailbox.
相比之下,Kerberos 身份验证系统(在侧边栏 11.6 [在线] 中描述)区分大小写,因此名称“alice”和“Alice”可以识别不同的用户。这一决定的理由很模糊。要求大小写准确匹配使入侵者更难猜出用户的名称,因此可以说这一决定增强了安全性。但是允许“alice”和“Alice”识别不同的用户可能会导致设置权限时出现严重错误,因此也可以说这一决定削弱了安全性。例如,在使用 Kerberos 身份验证实现邮件传递服务时,这一决定发挥了重要作用。无法正确地将 Kerberos 用户名直接映射到邮箱名,因为必要的强制可能会合并两个不同用户的身份。
In contrast, the Kerberos authentication system (described in Sidebar 11.6 [on-line]) is case-sensitive, so the names “alice” and “Alice” can identify different users. The rationale for this decision is muddy. Requiring that the case accurately match makes it harder for an intruder to guess a user’s name, so one can argue that this decision enhances security. But allowing “alice” and “Alice” to identify different users can lead to serious mistakes in setting up permissions, so one can also argue that this decision weakens security. This decision comes to a head, for example, in the implementation of a mail delivery service with Kerberos authentication. It is not possible to correctly do a direct mapping of Kerberos user names to mailbox names because the necessary coercion might merge the identities of two distinct users.
一个混合示例是 MIT 开发的名为 Hesiod 的服务命名服务,它使用互联网域名系统 (DNS) 作为子系统。Hesiod 可以命名的服务类型之一是远程文件系统。DNS(以及 Hesiod)不区分大小写,而UNIX系统中的文件系统名称区分大小写。这种差异导致了另一个用户界面故障的示例。如果用户要求连接远程文件系统并指定其 Hesiod 名称,Hesiod 将使用用户输入的大小写来定位文件系统,但UNIX mount命令会使用强制转换为小写的名称来挂载文件系统。因此,如果用户说要挂载名为CSE的远程文件系统,Hesiod 将定位该远程文件系统,但UNIX系统将使用名称cse来挂载它。要在文件名中使用此目录,用户必须以小写形式输入其名称,这可能会让人感到惊讶。
A mixed example is a service-naming service developed at M.I.T. and called Hesiod, which uses the Internet Domain Name System (DNS) as a subsystem. One of the kinds of services Hesiod can name is a remote file system. DNS (and thus Hesiod) is case-insensitive, while file system names in UNIX systems are case-sensitive. This difference leads to another example of a user interface glitch. If a user asks to attach a remote file system, specifying its Hesiod name, Hesiod will locate the file system using whatever case the user typed, but the UNIX mount command mounts the file system using the name coerced to lower case. Thus if the user says, for example, to mount the remote file system named CSE, Hesiod will locate that remote file system, but the UNIX system will mount it using the name cse. To use this directory in a file name, the user must then type its name in lower case, which may come as a surprise.
Hesiod 在较大的系统中用作子系统,因此区分大小写和不区分大小写的名称混合可能会变得更糟。例如,当前官方 MIT Web 服务器响应 URL
Hesiod is used as a subsystem in larger systems, so the mixing of case-sensitive and case-insensitive names can become worse. For example, the current official M.I.T. Web server responds to the URL
http://web.mit.edu/Alice/www/home.html
http://web.mit.edu/Alice/www/home.html
首先尝试对字符串/Alice/www/home.html进行简单的路径名解析。如果它从该路径名的解析中获得NOT-FOUND结果,它将提取路径名的第一个组件 ( Alice ) 并将其提交给 Hesiod 服务命名系统,并请求将其解释为远程文件系统名称。由于 Hesiod 不区分大小写,因此显示的名称是Alice、alice还是aLiCe并不重要。无论显示的名称是什么大小写,Hesiod 都会将其强制转换为标准大小写,然后返回相应远程文件系统目录的标准文件系统路径名,在本例中可能是
by first trying a simple path name resolution of the string /Alice/www/home.html. If it gets a NOT-FOUND result from the resolution of that path name, it extracts the first component of the path name (Alice) and presents it to the Hesiod service naming system, with a request to interpret it as a remote file system name. Since Hesiod is case-insensitive, it doesn’t matter whether the presented name is Alice, alice, or aLiCe. Whatever the case of the name presented, Hesiod coerces it to a standard case, and then it returns the standard file system path name of the corresponding remote file system directory, which for this example might be
/afs/athena/用户/alice
/afs/athena/user/alice
然后,Web 服务器用该路径名替换原来的第一个组件 ( Alice ),并尝试解析该路径名:
The Web server then replaces the original first component (Alice) with this path name and attempts to resolve the path name:
/afs/athena/用户/alice/www/home.html
/afs/athena/user/alice/www/home.html
因此,对于当前的 MIT Web 服务器,URL 中主机名后的第一个组件名称不区分大小写,而名称的其余部分区分大小写。
Thus for the current M.I.T. Web server, the first component name after the host name in a URL is case-insensitive, while the rest of the name is case-sensitive.
“Nynex 提议将曼哈顿线改用新的‘646’区号”
“Nynex is Proposing New ‘646’ Area Code for Manhattan Lines”
— 1997 年 3 月 3 日《华尔街日报》头条
— headline, Wall Street Journal, March 3, 1997
北美电话编号方案名称空间具有良好的层次结构,这似乎使添加电话号码变得容易。虽然这看起来像是一个无限名称空间的示例,但事实并非如此。它是分层的,但层次结构是刚性的 - 有固定数量的级别,并且每个级别都有固定的大小。
The North American telephone numbering plan name space is nicely hierarchical, which would seem to make it easy to add phone numbers. Although this appears to be an example of an unlimited name space, it is not. It is hierarchical, but the hierarchy is rigid—there is a fixed number of levels, and each level has a fixed size.
欧洲大部分地区的做法恰恰相反。在一些国家,似乎每个电话号码的位数都不一样。可变长度的编号方案有一个缺点。增长最多的地方的电话号码最长,因此电话呼叫最多。此外,由于中央交换机无法通过计算位数来找到可变长度电话号码的结尾,因此需要其他方案,例如注意用户已停止拨号一段时间。
Much of Europe does it the other way. In some countries it seems that every phone number has a different number of digits. A variable-length numbering plan has a downside. The telephone numbers are longest in the places that grew the most and thus have the most telephone calls. In addition, because the central exchange can’t find the end of a variable-length telephone number by counting digits, some other scheme is necessary, such as noticing that the user has stopped dialing for a while.
解决曼哈顿电话号码短缺问题的欧洲式解决方案是简单地宣布从现在起,曼哈顿的所有号码都将是 11 位数字。但由于整个美国电话系统都假设电话号码正好是 10 位数字,因此美国的解决方案是引入新的区号。
A European-style solution to the shortage of phone numbers in Manhattan would be to simply announce that from now on, all numbers in Manhattan will be 11 digits long. But since the entire American telephone system assumes that telephone numbers are exactly 10 digits long, the American solution is to introduce a new area code.
引入新区号有两种方式:拆分和覆盖。传统上,电话公司只使用拆分,但覆盖开始受到更广泛的关注。
A new area code can be introduced in one of two ways: splitting and overlay. Traditionally, the phone companies have used only splitting, but overlay is beginning to receive wider attention.
拆分(有时称为分区)是通过在旧区号中间划一条地理线(例如曼哈顿第 84 街)并宣布该线以北的所有人现在使用代码 646,而该线以南的所有人仍使用代码 212 来完成的。使用拆分时,没有人会“更改”他们的七位数号码,但许多人在给别人打电话时必须记住一个新号码。例如,
Splitting (sometimes called partition) is done by drawing a geographical line across the middle of the old area code—say 84th street in Manhattan—and declaring that everyone north of that line is now in code 646 and everyone south of that line will remain in code 212. When splitting is used, no one “changes” their seven-digit number, but many people must learn a new number when calling someone else. For example,
来自洛杉矶的呼叫者如果以前拨打 (212)–xxx–xxxx,现在如果拨打 84 街以北的电话则必须拨打 (646)–xxx–xxxx,但对于位于 84 街以南的电话,他们必须使用旧区号。
Callers from Los Angeles who used to dial (212)–xxx–xxxx must now dial (646)–xxx–xxxx if they are calling to a phone north of 84th street, but they must use the old area code for phones located south of 84th street.
现在,从 84 街的一侧打电话到另一侧需要添加区号,而以前只需拨打一个七位数字即可。
Calling from one side of 84th street to the other now requires adding an area code, where previously a seven-digit number was all one had to dial.
在替代方案(覆盖)中,区号 212 将继续覆盖整个曼哈顿,但是当该区号中没有剩余电话号码时,电话公司将开始分配区号为 646 的新号码。覆盖会给交换系统带来负担,并且它无法与 20 世纪 20 年代开发的逐步交换机配合使用,在这种交换机中,电话号码描述了到该电话的路由。当贝尔系统开始设计 20 世纪 40 年代推出的纵横制交换机时,它意识到这种不灵活性是一个致命的问题,因此它引入了号码路由查找 - 一种称为翻译器的名称解析系统-作为交换机设计的一部分。使用现代基于计算机的交换机,翻译很容易。所以现在没有什么(除了旧软件)可以防止同一交换机服务的两部电话拥有不同区号的号码。
In the alternative scheme, overlay, area code 212 would continue to cover all of Manhattan, but when there weren’t any phone numbers left in that area code, the telephone companies would simply start assigning new numbers with area code 646. Overlay places a burden on the switching system, and it wouldn’t have worked with the step-by-step switches developed in the 1920s, in which the telephone number described the route to that telephone. When the Bell System started to design the crossbar switches introduced in the 1940s, it realized that this inflexibility was a killer problem, so it introduced a number-to-route lookup—a name-resolving system called a translator—as part of switch design. With modern computer-based switches, translation is easy. So there is now nothing (but old software) to prevent two phones served by the same switch from having numbers with different area codes.
覆盖听起来是个好主意,因为这意味着来自洛杉矶的呼叫者继续拨打他们一直拨打的号码。然而,就像大多数工程权衡一样,总有人会输。现在曼哈顿的每个人都必须拨打一个 10 位数的号码才能到达曼哈顿的其他地方。人们不再能根据电话的地理位置来判断区号。人们也不能通过区号来精确定位目标的位置,因为区号已经失去了地理元数据的地位。如果人们搞不清楚自己是否在拨打收费电话,这可能会令人担忧。
Overlay sounds like a great idea because it means that callers from Los Angeles continue to dial the same numbers they have always dialed. However, as in most engineering trade-offs, someone loses. Everyone in Manhattan now has to dial a 10-digit number to reach other places in Manhattan. One no longer can tell what the area code is by the geographic location of the phone. One also can’t pinpoint the location of the target by its area code because the area code has lost its status as geographic metadata. This could be a concern if people become confused as to whether or not they were making a toll call.
另一种可能性是将始发电话的区号用作默认上下文。如果从 212 电话拨打电话,则无需拨打区号即可拨打另一个 212 号码。普遍的看法(可能是错误的)是人们无法处理由此产生的混乱。同一张桌子上的两部电话或两部相邻的付费电话可能具有不同的区号,因此要致电隔壁办公室的人,可能需要从这两部电话拨打不同的号码。
Another possibility would be to use as a default context the area code of the originating phone. If calling from a 212 phone, one wouldn’t have to dial an area code to call another 212 number. The prevailing opinion—which may be wrong—is that people can’t handle the resulting confusion. Two phones on the same desk, or two adjacent pay phones, may have different area codes, and thus to call someone in the next office one might have to dial different numbers from the two phones.
有一种应对方法:波士顿银行(早已合并到其他大型银行)曾安排,马萨诸塞州每个区号的客户服务中心都拨打 788-5000 号电话号码。全国免费电话 (800) 788-5000 也拨打该号码。虽然这种安排没有完全消除姓名翻译,但它大大减少了姓名翻译,并使剩余的姓名翻译变得足够简单,人们实际上可以记住如何翻译。
Here is one way of coping: BankBoston (long since merged into larger banks) once arranged that the telephone number 788–5000 ring its customer service center from every area code in the state of Massachusetts. The nationwide toll-free number (800) 788–5000 also rang there. Although that arrangement did not completely eliminate name translation, it reduced it significantly and made the remaining name translation simple enough that people could actually remember how to do it.
要求所有号码都用 10 位数字拨打,可以促进更一致的模式:拨打特定目标电话时所用的号码与拨打的号码无关。但这样做的代价是,拨打每个北美号码都需要 10 位数字,即使拨打隔壁的电话也是如此。北美电话系统长期以来一直在逐步朝这个方向发展。在许多地区,曾经只需拨打号码的最后四位数字,就可以拨打同一个交换机中的人的号码。后来需要五位数字。然后是七位。因此,跳转到 10 位数字将是该序列中的又一步。
Requiring that all numbers be dialed with all 10 digits encourages a more coherent model: the number you dial to reach a particular target phone does not depend on the number from which you are calling. The trade-off is that every North American number dialed would require 10 digits, even when calling the phone next door. The North American telephone system has been gradually moving in this direction for a long time. In many areas, it was once possible to call people in the same exchange simply by dialing just the last four digits of their number. Then it took five digits. Then seven. The jump to 10 would thus be another step in the sequence.
该报还报道说,按照曼哈顿电话号码使用的速度,几年内就需要另一个区号。这一观察结果似乎会影响决策。每次拆分都会造成混乱,但叠加只会在第一次拆分时造成混乱。如果很快就需要另一个区号,那么最好尽早使用叠加,因为使用叠加添加更多区号根本不会造成混乱。
The newspaper also reports that at the rate telephone numbers are being used up in Manhattan, another area code will be needed within a few years. That observation would seem to affect the decision. Splitting is disruptive every time, but overlay is disruptive only the first time it is done. If there is going to be another area code needed that soon, it might be better to use overlay at the earliest opportunity, since adding still more area codes with overlay will cause no disruption at all.
覆盖区号已经得到广泛使用。曼哈顿手机和寻呼机长期以来一直使用区号 917,几乎没有造成混乱。此外,为了回应“又一次号码变更”的强烈抗议,马萨诸塞州于 1997 年开始要求其电话号码计划的未来变更必须使用覆盖区号。
Overlay is already widely used. Manhattan cell phones and beepers have long used area code 917, and little confusion resulted. Also, in response to an outcry over “yet another number change”, in 1997 the Commonwealth of Massachusetts began requiring that future changes to its telephone numbering plan be done with overlay.
3.1Alyssa 请您帮助她了解如何在UNIX文件系统中处理元数据,如第 2.5 节所述。
3.1 Alyssa asks you for some help in understanding how metadata is handled in the UNIX file system, as described in Section 2.5.
3.1a Where does the UNIX system store system metadata about a file?
3.1b Where does it store user metadata about the file?
3.1c Where does it store system metadata about a file system?
2008-0-1
2008-0-1
3.2Bob 和 Alice 正在使用第 2.5 节中描述的UNIX文件系统。文件系统有两个磁盘,分别挂载为/disk1和/disk2。系统管理员通过以下命令创建一个“主”目录,其中包含指向 Bob 和 Alice 主目录的符号链接:
3.2 Bob and Alice are using a UNIX file system as described in Section 2.5. The file system has two disks, mounted as /disk1 and /disk2. A system administrator creates a “home” directory containing symbolic links to the home directories of Bob and Alice via the commands:
Subsequently, Bob types the following to his shell:
and receives an error.Which of the following best explains the problem?
A。当当前工作目录包含符号链接时, UNIX文件系统禁止在 cd 命令中使用“..”。
A. The UNIX file system forbids the use of “..” in a cd command when the current working directory contains a symbolic link.
B.由于 Alice 的主目录现在有两个父目录,因此系统抱怨该目录中的“..”含糊不清。
B. Since Alice’s home directory now has two parents, the system complains that “..” is ambiguous in that directory.
C。在 Alice 的主目录中,“..”是指向/disk1的链接,而目录“bob”位于/disk2中。
C. In Alice’s home directory, “..” is a link to /disk1, while the directory “bob” is in /disk2.
D.UNIX文件系统不支持指向其他磁盘上的目录的符号链接;它们的按名称调用语义允许创建它们,但在使用时会导致错误。
D. Symbolic links to directories on other disks are not supported in the UNIX file system; their call-by-name semantics allows their creation but causes an error when they are used.
2007-1-7
2007-1-7
3.3我们可以将上一个问题中的路径名标记为语义路径名。如果 Bob在工作目录d中键入“ cd .. ”,则该命令会将工作目录更改为创建d的目录。为了使“ .. ”的行为更直观,Alice 建议“ .. ”在路径名中的行为应符合语法。也就是说,目录d的父目录d/..与我们通过删除d的最后一个路径名组件来引用该目录时获得的目录相同。例如,如果 Bob 的当前工作目录是/a/b/c并且 Bob 键入“ cd .. ”,则结果与 Bob 执行“ cd /a/b ”完全相同
3.3 We can label the path names in the previous question as semantic path names. If Bob types “cd ..” while in working directory d, the command changes the working directory to the directory in which d was created. To make the behavior of “..” more intuitive, Alice proposes that “..” should behave in path names syntactically. That is, the parent of a directory d, d/.. is the same directory that would obtain if we instead referred to that directory by removing the last path name component of d. For example, if Bob’s current working directory is /a/b/c and Bob types “cd ..”, the result is exactly as if Bob had executed “cd /a/b”
3.3a如果UNIX文件系统要实现语法路径名,那么 Bob 在输入以下两个命令后最终会进入哪个目录?
3.3a If the UNIX file system were to implement syntactic path names, in which directory would Bob end up after typing the following two commands?
3.3b Under what circumstances do semantic path names and syntactic path names provide the same behavior?
A. When the name space of the file system forms an undirected graph.
B. When the name space of the file system forms a tree rooted at “/”.
C. When there are no synonyms for directories.
D. When symbolic links, like hard links, can be used as synonyms only for files.
3.3cBob 提出了以下语法名称实现。他将首先从语法上重写路径名以消除“ ..”,然后从根开始向前解析重写的路径名。与第 2.5 节中描述的语义路径名实现相比,此语法实现的缺点是什么?
3.3c Bob proposes the following implementation of syntactic names. He will first rewrite a path name syntactically to eliminate the “..”, and then resolve the rewritten path name forward from the root. Compared to the implementation of semantic path names as described in Section 2.5, what is a disadvantage of this syntactic implementation?
A. The syntactic implementation may require many more disk accesses than for semantic path names.
B. This cost of the syntactic implementation scales linearly with the number of path name components.
C. The syntactic implementation doesn’t work correctly in the presence of hard links.
D. The syntactic implementation doesn’t resolve “.” correctly in the current working directory.
2007-0-1
2007-0-1
3.4文件的 inode 在UNIX文件系统中起着重要作用。以下哪项陈述对第 2.5 节中描述的 inode 数据结构是正确的?
3.4 The inode of a file plays an important role in the UNIX file system. Which of these statements is true of the inode data structure, as described in Section 2.5?
A. The inode of a file contains a reference count.
B. The reference count of the inode of a directory should not be larger than 1.
C. The inode of a directory contains the inodes of the files in the directory.
D.符号链接的 inode 包含链接目标的 inode 编号。
D. The inode of a symbolic link contains the inode number of the target of the link.
E.目录的 inode 包含目录中文件的 inode 编号。
E. The inode of a directory contains the inode numbers of the files in the directory.
F. The inode number is a disk address.
G. A file’s inode is stored in the first 64 bytes of the file.
2005-1-4、2006-1-1 和 2008-1-3
2005-1-4, 2006-1-1, and 2008-1-3
3.5第 3.3.1 节描述了一个名称冲突问题。该系统的设计者可以采取哪些不同的措施来消除(或将发生该问题的可能性降低到可以忽略的程度)该问题发生的可能性?
3.5 Section 3.3.1 describes a name collision problem. What could the designer of that system have done differently to eliminate (or reduce to a negligible probability) the possibility of this problem arising?
2008-0-2
2008-0-2
与第 3 章相关的附加练习可以在从第 425 页开始的问题集中找到。
Additional exercises relating to Chapter 3 can be found in the problem sets beginning on page 425.
* 20 世纪 70 年代,IBM 进行了一次雄心勃勃的尝试,试图设计一种命名架构,将所有这些概念都嵌入硬件中,这在 George Radin 和 Peter R. Schneider 撰写的一份技术报告中有所记录:《一种具有受保护寻址的扩展机器的架构》,IBM 波基普西实验室技术报告 TR 00.2757,1976 年 5 月。虽然该架构本身从未进入市场,但后来其中一些想法出现在 IBM System/38 和 AS/400 计算机系统中。
* An ambitious attempt to design a naming architecture with all of these concepts wired into the hardware was undertaken by IBM in the 1970s, documented in a technical report by George Radin and Peter R. Schneider: An architecture for an extended machine with protected addressing, IBM Poughkeepsie Laboratory Technical Report TR 00.2757, May, 1976. Although the architecture itself never made it to the market, some of the ideas later appeared in the IBM System/38 and AS/400 computer systems.
*使用“重载”一词来描述携带元数据的名称与使用同一个词来描述编程语言中代表几个不同运算符的符号类似,但有区别。
* Use of the word “overloading” to describe names that carry metadata is similar to, but distinct from, the use of the same word to describe symbols that stand for several different operators in a programming language.
*事实上,他们不可能同时造访佛罗伦萨。数学家列奥纳多·迪·比萨(又名斐波那契)比艺术家列奥纳多·达·芬奇早生活了三个世纪。
* Actually, they could not have both visited Florence at the same time. The mathematician Leonardo di Pisa (also known as Fibonacci) lived three centuries before the artist Leonardo da Vinci.
*本案例研究非正式地介绍了三个与消息相关的概念,后续章节将更仔细地定义这些概念:客户端(发起请求消息的实体);服务器(响应客户端请求的实体);协议(关于发送什么消息以及如何解释其内容的协议)。第 4 章扩展了客户端/服务模型,第 7 章 [在线] 扩展了对协议的讨论。
* This case study informally introduces three message-related concepts that succeeding chapters will define more carefully: client (an entity that originates a request message); server (an entity that responds to a client’s request); and protocol (an agreement on what messages to send and how to interpret their contents.) Chapter 4 expands on the client/service model, and Chapter 7 [on-line] expands the discussion of protocols.
* Tim Berners-Lee,《通用资源标识符:建议》。
* Tim Berners-Lee, Universal Resource Identifiers: Recommendations.
4.1客户/服务组织
4.1 Client/Service Organization
4.1.1从软模块化到强制模块化
4.1.1 From Soft Modularity to Enforced Modularity
4.1.2客户/服务组织
4.1.2 Client/Service Organization
4.1.3多客户端和服务
4.1.3 Multiple Clients and Services
4.1.4值得信赖的中介机构
4.1.4 Trusted Intermediaries
4.1.5一个简单的示例服务
4.1.5 A Simple Example Service
4.2客户端与服务之间的通信
4.2 Communication Between Client and Service
4.2.1远程过程调用 (RPC)
4.2.1 Remote Procedure Call (RPC)
4.2.2RPC 与过程调用不同
4.2.2 RPCs are not Identical to Procedure Calls
4.2.3通过中介进行沟通
4.3总结和未来之路
4.3 Summary and The Road Ahead
4.4 Case Study: The Internet Domain Name System (DNS)
4.4.1DNS 中的名称解析
4.4.1 Name Resolution in DNS
4.4.2分层名称管理
4.4.2 Hierarchical Name Management
4.4.3DNS 的其他功能
4.4.3 Other Features of DNS
4.4.4DNS 中的名称发现
4.4.4 Name Discovery in DNS
4.4.5DNS 响应的可信度
4.5 Case Study: The Network File System (NFS)
4.5.1命名远程文件和目录
4.5.1 Naming Remote Files and Directories
4.5.2NFS 远程过程调用
4.5.2 The NFS Remote Procedure Calls
4.5.3扩展UNIX文件系统以支持 NFS
4.5.3 Extending the UNIX File System to Support NFS
4.5.4连贯性
4.5.4 Coherence
4.5.5NFS 版本 3 及更高版本
4.5.5 NFS Version 3 and Beyond
前面的章节证明了将系统划分为模块是好的,并展示了如何使用名称连接模块。如果所有模块都正确实现,任务就完成了。然而,在实践中,程序员会犯错误,如果不加思索,实现中的错误很容易从一个模块传播到另一个模块。为了避免这个问题,我们需要加强模块化。本章介绍了一种更强的模块化形式,称为强制模块化,它有助于限制错误从一个模块传播到另一个模块。在本章中,我们将重点介绍软件模块。在第 8 章 [在线] 中,我们开发了处理硬件模块的技术。
The previous chapters established that dividing a system into modules is good and showed how to connect modules using names. If all of the modules were correctly implemented, the job would be finished. In practice, however, programmers make errors, and without extra thought, errors in implementation may too easily propagate from one module to another. To avoid that problem, we need to strengthen the modularity. This chapter introduces a stronger form of modularity, called enforced modularity, that helps limit the propagation of errors from one module to another. In this chapter we focus on software modules. In Chapter 8 [on-line] we develop techniques to handle hardware modules.
限制软件模块之间交互的一种方法是将系统组织为客户端和服务。在客户端/服务组织中,模块仅通过发送消息进行交互。这种组织有三个主要好处:
One way to limit interactions between software modules is to organize systems as clients and services. In the client/service organization, modules interact only by sending messages. This organization has three main benefits:
消息是程序员请求模块提供服务的唯一方式。将交互限制在消息中会使程序员更难违反模块化约定。
Messages are the only way for a programmer to request that a module provide a service. Limiting interactions to messages makes it more difficult for programmers to violate the modularity conventions.
消息是错误在模块之间传播的唯一途径。如果客户端和服务各自发生故障,并且客户端和服务检查消息,则它们可能能够限制错误的传播。
Messages are the only way for errors to propagate between modules. If clients and services fail independently and if the client and the service check messages, they may be able to limit the propagation of errors.
消息是攻击者入侵模块的唯一途径。如果客户端和服务在采取行动之前仔细检查消息,则可以阻止攻击。
Messages are the only way for an attacker to penetrate a module. If clients and services carefully check the messages before they act on them, they can block attacks.
由于这三个好处,系统设计人员使用客户端/服务组织作为构建模块化、容错和安全系统的起点。
Because of these three benefits, system designers use the client/service organization as a starting point for building modular, fault tolerant, and secure systems.
设计人员使用客户端/服务模型来分离较大的软件模块,而不是单独的程序。例如,数据库系统可以组织为客户端,客户端将查询消息发送到实现完整数据库管理系统的服务。再举一个例子,电子邮件应用程序可以组织为阅读器(即客户端),从存储邮箱的服务收集电子邮件。
Designers use the client/service model to separate larger software modules, rather than, say, individual procedures. For example, a database system might be organized as clients that send messages with queries to a service that implements a complete database management system. As another example, an e-mail application might be organized into readers—the clients—that collect e-mail from a service that stores mailboxes.
实现客户端/服务模型的一种有效方法是让每个客户端和服务模块在其自己的计算机上运行,并通过线路在计算机之间建立通信路径。如果每个模块都有自己的计算机,那么如果一台计算机(模块)发生故障,另一台计算机(模块)可以继续运行。由于唯一的通信路径就是那条线路,因此这也是错误传播的唯一路径。
One effective way to implement the client/service model is to run each client and service module in its own computer and set up a communication path over a wire between the computers. If each module has its own computer, then if one computer (module) fails, the other computer (module) can continue to operate. Since the only communication path is that wire, that is also the only path by which errors can propagate.
本章第 4.1节介绍了客户端/服务模型如何强制模块之间的模块化。第 4.2 节介绍了两种发送和接收消息的样式:远程过程调用和发布/订阅。第 4.3 节总结了本章中确定但未解决的主要问题,并提出了解决这些问题的路线图。最后,对两个广泛使用的客户端/服务应用程序(Internet 域名系统和网络文件系统)进行了详细的案例研究。
Section 4.1 of this chapter shows how the client/service model can enforce modularity between modules. Section 4.2 presents two styles of sending and receiving messages: remote procedure call and publish/subscribe. Section 4.3 summarizes the major issues identified in this chapter but not addressed, and presents a road map for addressing them. Finally, there are detailed case studies of two widely used client/service applications, the Internet Domain Name System and the Network File System.
在大型程序中创建模块化的标准方法是将其划分为相互调用的命名过程。尽管可以将生成的结构称为模块化,但实现错误可能会从调用者传播到被调用者,反之亦然,而不仅仅是通过它们指定的接口。例如,如果程序员犯了一个错误并在被调用过程中引入了无限循环,并且该过程永不返回,则被调用者将永远不会再获得控制权。或者由于调用者和被调用者位于同一地址空间并使用同一堆栈,因此任何一方都可能意外地将某些内容存储在分配给另一方的空间中。因此,我们将这种模块化称为软模块化。软模块化将正确实现的模块的交互限制在其指定的接口内,但实现错误可能导致超出指定接口的交互。
A standard way to create modularity in a large program is to divide it up into named procedures that call one another. Although the resulting structure can be called modular, implementation errors can propagate from caller to callee and vice versa, and not just through their specified interfaces. For example, if a programmer makes a mistake and introduces an infinite loop in a called procedure and the procedure never returns, then the callee will never receive control again. Or since the caller and callee are in the same address space and use the same stack, either one can accidentally store something in a space allocated to the other. For this reason, we identify this kind of modularity as soft. Soft modularity limits interactions of correctly implemented modules to their specified interfaces, but implementation errors can cause interactions that go outside the specified interfaces.
为了加强模块化,我们需要在模块之间设置严格的界限,这样错误就不会轻易地从一个模块传播到另一个模块。就像建筑物有防火墙将火势控制在建筑物的某个区域内,并防止火势蔓延到其他区域一样,我们也需要一个组织来将模块之间的交互限制在其定义的接口内。
To enforce modularity, we need hard boundaries between modules so that errors cannot easily propagate from one module to another. Just as buildings have firewalls to contain fires within one section of the building and keep them from propagating to other sections, so we need an organization that limits the interaction between modules to their defined interfaces.
本节介绍客户端/服务组织,这是一种构建系统的方法,可以限制错误传播到指定消息的接口。这种组织有两个好处:首先,错误只能通过消息传播。其次,客户端只需考虑消息就可以检查某些错误。虽然这种方法并没有限制所有错误的传播,但它在推理模块之间的交互方面提供了全面的简化。
This section introduces the client/service organization as one approach to structuring systems that limit the interfaces through which errors can propagate to the specified messages. This organization has two benefits: first, errors can propagate only with messages. Second, clients can check for certain errors by just considering the messages. Although this approach doesn’t limit the propagation of all errors, it provides a sweeping simplification in terms of reasoning about the interactions between modules.
作为模块如何交互的一个更具体的例子,假设我们正在编写一个简单的程序来测量一个函数的运行时间。我们可能希望将其拆分为两个模块:(1)一个系统模块,提供一个接口来获取调用者指定的单位的时间;(2)一个应用程序模块,通过从时钟设备请求时间、运行该函数并在函数完成后从时钟设备请求时间来测量函数的运行时间。这种拆分的目的是将测量程序与时钟设备的细节分开:
As a more concrete example of how modules interact, suppose we are writing a simple program that measures how long a function runs. We might want to split it into two modules: (1) one system module that provides an interface to obtain the time in units specified by the caller and (2) one application module that measures the running time of a function by asking for time from the clock device, running the function, and requesting the time from the clock device after the function completes. The purpose of this split is to separate the measurement program from the details of the clock device:
过程MEASURE以函数func作为参数并测量其运行时间。过程GET_TIME返回以调用者指定的单位测量的时间。我们可能需要在模块中进行这种明确的分离,因为例如,我们不希望每个需要时间的函数都知道所有使用时钟的应用程序(如MEASURE )中时钟的物理地址( GETTIME第 2行中的CLOCK )。在一台计算机上,时钟的物理地址为 17E5 hex,但在下一台计算机上,它位于 24FFF2 hex。或者某些时钟返回微秒,而其他时钟返回六十分之一秒。通过将时钟的特定属性放入GET_TIME ,当程序移到另一台计算机时, GET_TIME的调用者不必更改;只有GET_TIME必须更改。
The procedure MEASURE takes a function func as argument and measures its running time. The procedure GET_TIME returns the time measured in the units specified by the caller. We may desire this clear separation in modules because, for example, we don’t want every function that needs the time to know the physical address of the clock (CLOCK in line 2 of GETTIME) in all application programs, such as MEASURE, that use the clock. On one computer, the clock is at physical address 17E5hex, but on the next computer it is at 24FFF2hex. Or some clocks return microseconds, and others return sixtieths of a second. By putting the clock’s specific properties into GET_TIME, the callers of GET_TIME do not have to be changed when a program is moved to another computer; only GET_TIME must be changed.
然而, GET_TIME和其调用者之间的界限是软性的。尽管过程调用是模块化的主要工具,但错误仍然很容易从一个模块泄漏到另一个模块。很明显,如果GET_TIME返回错误答案,则调用者有问题。不太明显的是,即使GET_TIME返回正确答案, GET_TIME中的编程错误也会给调用者带来麻烦。本节解释了为什么过程调用允许传播各种各样的错误,并将介绍一种类似于过程调用但更严格地限制错误传播的替代方案。
This boundary between GET_TIME and its caller, is soft, however. Although procedure call is a primary tool for modularity, errors can still leak too easily from one module to another. It is obvious that if GET_TIME returns a wrong answer, the caller has a problem. It is less obvious that programming errors in GET_TIME can cause trouble for the caller even if GET_TIME returns a correct answer. This section explains why procedure call allows propagation of a wide variety of errors and will introduce an alternative that resembles procedure call but that more strongly limits propagation of errors.
要了解过程调用为何允许传播多种错误,必须查看过程调用的工作方式以及实现过程调用的处理器指令。有很多方法可以将过程和从MEASURE到GET_TIME 的调用编译为处理器指令。为了具体起见,我们选择一种过程调用约定。其他约定在细节上有所不同,但都表现出我们想要探究的相同问题。
To see why procedure calls allow propagation of many kinds of errors, one must look at the detail of how procedure calls work and at the processor instructions that implement procedure calls. There are many ways to compile the procedures and the call from MEASURE to GET_TIME into processor instructions. For concreteness we pick one procedure call convention. Others differ in the details but exhibit the same issues that we want to explore.
我们用堆栈来实现对GET_TIME的调用,这样GET_TIME就可以调用其他过程(尽管在本例中它没有这样做)。通常,被调用的过程可以调用另一个过程,甚至可以递归调用自身。为了允许调用其他过程,实现必须遵守堆栈规则:每次调用过程时,堆栈必须保持原样。
We implement the call to GET_TIME with a stack, so that GET_TIME could call other procedures (although in this example it does not do so). In general, a called procedure may call another procedure or even call itself recursively. To allow for calls to other procedures, the implementation must adhere to the stack discipline: each invocation of a procedure must leave the stack as it found it.
为了遵守这一原则,必须有一个约定,规定谁保存哪些寄存器、谁将参数放入堆栈、谁移除参数以及谁在堆栈上为临时变量分配空间。系统使用的特定约定称为过程调用约定。我们使用图 4.1中所示的约定。每个过程调用都会产生一个新的堆栈框架,其中包含用于保存寄存器的空间、被调用者的参数、被调用者应返回的地址以及被调用者的局部变量。
To adhere to this discipline, there must be a convention for who saves what registers, who puts the arguments on the stack, who removes them, and who allocates space on the stack for temporary variables. The particular convention used by a system is called the procedure calling convention. We use the convention shown in Figure 4.1. Each procedure call results in a new stack frame, which has space for saved registers, the arguments for the callee, the address where the callee should return, and local variables of the callee.
图 4.1过程调用约定。
Figure 4.1 Procedure call convention.
鉴于这种调用约定,这两个模块的处理器指令如图4.2所示。在此示例中,调用方 ( MEASURE )的指令从地址 100 开始,被调用方 ( GET_TIME )的指令从地址 200 开始。堆栈从低地址向高地址增长。过程的返回值通过寄存器R 0 传递。为简单起见,假设指令、内存位置和地址都是 4 字节宽。对于我们的示例,MEASURE调用GETTIME如下:
Given this calling convention, the processor instructions for these two modules are shown in Figure 4.2. In this example, the instructions of the caller (MEASURE) start at address 100, the instructions of the callee (GET_TIME) start at address 200. The stack grows up, from a low address to a high address. The return value of a procedure is passed through register R0. For simplicity, assume that instructions, memory locations, and addresses are all 4 bytes wide. For our example, MEASURE invokes GETTIME as follows:
图4.2过程MEASURE(位于地址100)调用GET_TIME(位于地址200)。
Figure 4.2 The procedure MEASURE (located at address 100) calls GET_TIME (located at address 200).
1.调用者将临时寄存器(R 1 和R 2)的内容保存在地址 100 到 112 处。
1. The caller saves content of temporary registers (R1 and R2) at addresses 100 through 112.
2.调用者将参数存储在堆栈上(地址 116 到 124),以便被调用者可以找到它们。(GET_TIME接受一个参数:unit。)
2. The caller stores the arguments on the stack (address 116 through 124) so that the callee can find them. (GET_TIME takes one argument: unit.)
3.调用者将返回地址存储在堆栈上(地址 128 到 136),以便被调用者可以知道调用者应在何处恢复执行。(返回地址是 148。)
3. The caller stores a return address on the stack (address 128 through 136) so that the callee can know where the caller should resume execution. (The return address is 148.)
4.调用者通过跳转到其第一条指令的地址(地址 140 和 144)将控制权转移给被调用者。(被调用者GET_TIME位于地址 200。)我们示例的堆栈现在如下图所示。
4. The caller transfers control to the callee by jumping to the address of its first instruction (address 140 and 144). (The callee, GET_TIME, is located at address 200.) The stack for our example looks now as in the following figure.
5.被调用者将其参数从堆栈加载到R 2 中(地址 200 到 208)。
5. The callee loads its argument from the stack into R2 (address 200 through 208).
6.被调用者使用参数进行计算,也许调用其他函数(地址 212)。
6. The callee computes with the arguments, perhaps calling other functions (address 212).
7.被调用者将GET_TIME的返回值加载到R 0(实现为返回值保留的寄存器)中(地址220)。
7. The callee loads the return value of GET_TIME into R0, the register the implementation reserves for returning values (address 220).
8.被调用者将返回地址从堆栈加载到PC(地址 224 到 232)中,这导致调用者在地址 148 处恢复控制。
8. The callee loads the return address from the stack into PC (address 224 through 232), which causes the caller to resume control at address 148.
9. The caller adjusts the stack (address 148).
10.调用者恢复R 1 和R 2 (地址 152 至 164) 的内容。
10. The caller restores content of R1 and R2 (addresses 152 through 164).
对于图 4.2中的特定示例,我们使用处理器的低级指令,因为它揭示了调用方和被调用方之间契约的细则,并展示了错误如何传播。在MEASURE示例中,契约指定被调用方以某种约定的表示形式向调用方返回当前时间。但是,如果我们深入研究,就会发现此功能规范并不是完整的契约,并且契约没有很好的方法来限制错误的传播。要揭示模块之间契约的细则,我们需要检查图 4.2中的堆栈如何用于将控制权从一个模块转移到另一个模块。调用方和被调用方之间的契约包含几个微妙的潜在问题:
We use the low-level instructions of the processor for the specific example in Figure 4.2 because it exposes the fine print of the contract between the caller and the callee, and shows how errors can propagate. In the MEASURE example, the contract specifies that the callee returns the current time in some agreed-upon representation to the caller. If we look under the covers, however, we see that this functional specification is not the full contract and that the contract doesn’t have a good way of limiting the propagation of errors. To uncover the fine print of the contract between modules, we need to inspect how the stack from Figure 4.2 is used to transfer control from one module to another. The contract between caller and callee contains several subtle potential problems:
根据约定,调用者和被调用者只修改堆栈中的共享参数和它们自己的变量。被调用者保留堆栈指针和堆栈,就像调用者设置的那样。如果被调用者出现问题,破坏了调用者的堆栈区域,那么调用者稍后可能会计算出错误的结果或失败。
By contract, the caller and callee modify only shared arguments and their own variables in the stack. The callee leaves the stack pointer and the stack the way the caller has set it up. If there is a problem in the callee that corrupts the caller’s area of the stack, then the caller might later compute incorrect results or fail.
根据约定,被调用者会返回调用者指示的位置。如果被调用者错误地返回到其他地方,则调用者可能执行了错误的计算或完全失去控制并失败。
By contract, the callee returns where the caller told it to. If by mistake the callee returns somewhere else, then the caller probably performs an incorrect computation or loses control completely and fails.
根据合同,被调用者将返回值存储在寄存器R 0 中。如果被调用者错误地将返回值存储在其他位置,则调用者将读取寄存器R 0 中的任何值,并可能执行错误的计算。
By contract, the callee stores return values in register R0. If by mistake the callee stores the return value somewhere else, then the caller will read whatever value is in register R0 and probably perform an incorrect computation.
根据约定,调用者在调用被调用者之前将临时寄存器( R1、R2等)中的值保存在堆栈上,并在重新获得控制权时恢复这些值。如果调用者不遵守约定,则当调用者重新获得控制权时,被调用者可能已经更改了临时寄存器的内容,并且调用者可能执行了错误的计算。
By contract, the caller saves the values in the temporary registers (R1, R2, etc.) on the stack before the call to the callee and restores them when it receives control back. If the caller doesn’t follow the contract, the callee may have changed the content of the temporary registers when the caller receives control back, and the caller probably performs an incorrect computation.
被调用方的灾难可能会对调用方产生副作用。例如,如果被调用方除以零并因此终止,则调用方也可能会终止。这种影响俗称命运共享。
Disasters in the callee can have side effects in the caller. For example, if the callee divides by zero and, as a result, terminates, the caller may terminate too. This effect is known colloquially as fate sharing.
如果调用者和被调用者共享全局变量,那么根据约定,调用者和被调用者只能修改它们之间共享的全局变量。同样,如果调用者或被调用者修改了其他全局变量,它们(或其他模块)可能会计算错误或完全失败。
If the caller and callee share global variables, then by contract, the caller and callee modify only those global variables that are shared between them. Again, if the caller or callee modifies some other global variable, they (or other modules) might compute incorrectly or fail altogether.
因此,过程调用契约为我们提供了所谓的软模块化。如果程序员犯了错误,或者过程调用约定的实现存在错误,这些错误很容易从被调用者传播到调用者。软模块化通常是通过规范实现的,但没有什么可以强制模块之间的交互到它们定义的接口。如果被调用者不遵守(有意或无意地)契约,调用者就会遇到严重的问题。我们的模块化并没有得到强制执行。
Thus, the procedure call contract provides us with what might be labeled soft modularity. If a programmer makes an error or there is an error in the implementation of the procedure call convention, these errors can easily propagate from the callee to the caller. Soft modularity is usually attained through specifications, but nothing forces the interactions among modules to their defined interfaces. If the callee doesn’t adhere (intentionally or unintentionally) to the contract, the caller has a serious problem. We have modularity that is not enforced.
错误传播还存在其他可能性。这些过程共享相同的地址空间,如果一个有缺陷的过程错误地破坏了全局变量,那么即使没有调用有缺陷的过程也可能受到影响。任何不遵守合同的过程(无论是有意还是无意)都可能给其他模块带来麻烦。
There are also other possibilities for propagation of errors. The procedures share the same address space, and, if a defective procedure incorrectly smashes a global variable, even a procedure that did not call the defective one may be affected. Any procedure that doesn’t adhere, either intentionally or unintentionally, to the contract may cause trouble for other modules.
使用受约束且类型安全的实现语言(如 Java)可以在一定程度上增强软模块化(参见边栏 4.1),但对于完整的系统而言还不够。首先,系统中的所有模块都用类型安全的语言实现的情况并不常见。通常,出于性能原因,系统的某些模块会使用不强制模块化的编程语言(如 C、C++ 或处理器指令)编写。但即使整个系统都是用类型安全的语言(如 Java)开发的,我们也需要更强的模块化。如果任何 Java 模块引发错误(因为解释器引发类型违规、模块分配的内存超过可用内存、模块无法打开文件等)或出现编程错误(例如,无限循环),我们希望确保其他模块不会立即失败。即使被调用的过程没有返回,我们也希望确保调用者遇到可控的问题。
Using a constrained and type-safe implementation language such as Java can beef up soft modularity to a certain extent (see Sidebar 4.1) but is insufficient for complete systems. For one, it is uncommon that all modules in a system are implemented in type-safe language. Often some modules of a system are for performance reasons written in a programming language that doesn’t enforce modularity, such C, C++, or processor instructions. But even if the whole system is developed in a type-safe language like Java, we have a need for stronger modularity. If any of the Java modules raises an error (because the interpreter raises a type violation, the module allocated more memory than available, the module couldn’t open a file, etc.) or has a programming error (e.g., an infinite loop), we would like to ensure that other modules don’t immediately fail too. Even if a called procedure doesn’t return, we would like to ensure that the caller has a controlled problem.
边栏 4.1 使用高级语言实现模块化
Sidebar 4.1 Enforcing Modularity with a High-Level Languages
高级语言有助于实现模块化,因为它的编译器和运行时系统执行所有堆栈和寄存器操作,大概都是准确的并符合过程调用约定。此外,如果编程语言强制执行限制,即程序只能根据语言管理的变量的类型写入对应于该变量的内存位置,那么程序就不能覆盖任意内存位置,并且不能破坏堆栈。也就是说,程序不能使用整型变量的值作为内存位置的地址,然后将其存储到该内存位置。这样的语言称为强类型语言,如果程序无法以任何方式避开类型系统,则称为类型安全的。强类型语言的现代例子包括 Java 和 C#。
A high-level language is helpful in enforcing modularity because its compiler and runtime system perform all stack and register manipulation, presumably accurately and in accordance with the procedure calling convention. Furthermore, if the programming language enforces a restriction that programs write only to memory locations that correspond to variables managed by the language and in accordance with their type, then programs cannot overwrite arbitrary memory locations and, for example, corrupt the stack. That is, a program cannot use the value of a variable of type integer as an address of a memory location and then store to that memory location. Such languages are called strongly typed and, if a program cannot avoid the type system in any way, type safe. Modern examples of strongly typed languages include Java and C#.
但即使对于强类型语言,通过过程调用实现的模块化也不会将模块之间的交互限制在其定义的接口上。例如,如果被调用方出现编程错误并分配了所有可用的内存空间,则调用方可能无法继续。此外,强类型语言允许程序员逃避语言的类型系统,以直接访问内存或处理器寄存器,并使用该语言不支持的系统功能(例如,读取和写入与控制寄存器和设备状态相对应的内存位置)。但这种访问为程序员犯下违反过程调用契约的错误打开了道路。
But even with strongly typed languages, modularity through procedure calls doesn’t limit the interactions between modules to their defined interfaces. For example, if the callee has a programming error and allocates all of the available memory space, then the caller may be unable to proceed. Also, strongly typed languages allow the programmer to escape the type system of the language to obtain direct access to memory or to processor registers and to exercise system features that the language does not support (e.g., reading and writing memory locations that correspond to the control registers and state of a device). But this access opens a path for the programmer to make mistakes that violate the procedure call contract.
另一个问题是,在许多计算机系统中,不同的模块是用不同的编程语言编写的,这可能是因为现有的较旧的模块被重用,即使其实现语言不提供类型安全功能,或者因为较低级别的语言片段对于实现最大性能至关重要。即使调用者和被调用者是用两种不同的强类型语言编写的,由于它们的约定不匹配,它们的接口上也可能发生意外的交互。
Another concern is that in many computer systems different modules are written in different programming languages, perhaps because an existing, older module is being reused, even though its implementation language does not provide the type-safety features, or because a lower-level language fragment is essential for achieving maximum performance. Even when the caller and callee are written in two different, strongly typed languages, unexpected interactions can occur at their interface because their conventions do not match.
另一个错误源在实践中似乎很少发生,即应用程序解释器中的实现错误(尽管随着编译器、运行时支持系统和处理器设计的复杂性不断增加,这种错误源可能仍然会变得重要)。编译器可能有编程错误,运行时支持系统可能错误地设置了堆栈,处理器或操作系统可能在中断时错误地保存和恢复寄存器,内存错误导致LOAD指令返回不正确的值,等等。虽然这些错误源发生的可能性比编程错误要小,但最好还是控制住由此产生的错误,以免它们传播到其他模块。
Another source of errors, which in practice seem to occur much less often, is an implementation error in the interpreter of the application (though with increasing complexity of compilers, runtime support systems, and processor designs, this source may yet become significant). The compiler may have a programming error, the runtime support system may have set up the stack incorrectly, the processor or operating system may save and restore registers incorrectly on an interrupt, a memory error causes a LOAD instruction to return an incorrect value, and so on. Although these sources are less likely to occur than programming errors, it is good to contain the resulting errors so that they don’t propagate to other modules.
出于所有这些原因,设计人员使用客户/服务组织。将客户/服务组织与使用强类型语言编写系统相结合,为实施模块化提供了更多机会;例如,参见 Singularity 操作系统的设计 [进一步阅读建议 5.2.3 ]。
For all these reasons, designers use the client/service organization. Combining the client/service organization with writing a system in a strongly typed language offers additional opportunities for enforcing modularity; see, for example, the design of the Singularity operating system [Suggestions for Further Reading 5.2.3].
我们希望系统能够强制模块化:通过某种外部机制强制模块化。这种外部机制将模块之间的交互限制在我们想要的范围内。这种交互限制减少了错误传播的机会。它还允许验证用户是否正确使用模块,并有助于防止攻击者侵入模块的安全性。
What we desire in systems is enforced modularity: modularity that is enforced by some external mechanism. This external mechanism limits the interaction among modules to the ones we desire. Such a limit on interactions reduces the number of opportunities for propagation of errors. It also allows verification that a user uses a module correctly, and it helps prevent an attacker from penetrating the security of a module.
实施模块化的一个好方法是将模块之间的交互限制为显式消息。通过将通信中的参与者标识为客户端或服务,可以方便地对这种组织施加某种结构。
One good way to enforce modularity is to limit the interactions among modules to explicit messages. It is convenient to impose some structure on this organization by identifying participants in a communication as clients or services.
图 4.3显示了客户端和服务之间的常见交互。客户端是发起请求的模块:它构建一条消息,其中包含服务执行其工作所需的所有数据,并将其发送给服务。服务是响应的模块:它从请求消息中提取参数,执行请求的操作,构建响应消息,将响应消息发送回客户端,然后等待下一个请求。客户端从响应消息中提取结果。为方便起见,从客户端到服务的消息称为请求,消息称为响应或回复。
Figure 4.3 shows a common interaction between client and service. The client is the module that initiates a request: it builds a message containing all the data necessary for the service to carry out its job and sends it to a service. The service is the module that responds: it extracts the arguments from the request message, executes the requested operations, builds a response message, sends the response message back to the client, and waits for the next request. The client extracts the results from the response message. For convenience, the message from the client to the service is called the request, and the message is called the response or reply.
图4.3客户端和服务之间的通信。
Figure 4.3 Communication between client and service.
图 4.3显示了客户端和服务交互的一种常见方式:请求之后总是会有一个响应。由于客户端和服务可以使用许多其他消息序列进行交互,因此设计人员通常使用消息时序图来表示交互(参见边栏 4.2)。图 4.3是简单时序图的一个实例。
Figure 4.3 shows one common way in which a client and a service interact: a request is always followed by a response. Since a client and a service can interact using many other sequences of messages, designers often represent the interactions using message timing diagrams (see Sidebar 4.2). Figure 4.3 is an instance of a simple timing diagram.
边栏 4.2 表示时序图
Sidebar 4.2 Representation Timing Diagrams
时序图是模块间交互的一种便捷表示。当系统以客户端/服务样式组织时,这种表示方式特别方便,因为模块间的交互仅限于消息。在时序图中,模块的生命周期用垂直线表示,时间沿垂直轴向下增加。以下示例说明了污水泵送系统中时序图的使用。时间线顶部的标签命名模块(泵控制器、传感器服务和泵服务)。模块之间的物理分离以水平方式表示。由于消息从一个点到达另一个点需要时间,因此从泵控制器到泵服务的消息用向右下方倾斜的箭头表示。
A timing diagram is a convenient representation of the interaction between modules. When the system is organized in a client/service style, this presentation is particularly convenient, because the interactions between modules are limited to messages. In a timing diagram, the lifetime of a module is represented by a vertical line, with time increasing down the vertical axis. The following example illustrates the use of a timing diagram for a sewage pumping system. The label at the top of a timeline names the module (pump controller, sensor service, and pump service). The physical separation between modules is represented horizontally. Since it takes time for a message to get from one point to another, a message going from the pump controller to the pump service is represented by an arrow that slopes downward to the right.
模块执行操作并发送和接收消息。时间旁边的标签表示模块在特定时间采取的操作。模块可以同时采取行动,例如,如果它们在不同的处理器上运行。
The modules perform actions, and send and receive messages. The labels next to the time indicate actions taken by the module at a certain time. Modules can take actions at the same time, for example, if they are running on different processors.
箭头表示消息。箭头的起点表示消息由发送模块发送的时间,箭头的终点表示消息在目标模块接收的时间。消息的内容由与箭头关联的标签描述。在一些示例中,消息可能会重新排序(箭头交叉)或丢失(箭头在到达模块之前中途终止)。
The arrows indicate messages. The start of the arrow indicates the time the message is sent by the sending module, and the point of an arrow indicates the time the message is received at the destination module. The content of a message is described by the label associated with the arrow. In some examples, messages can be reordered (arrows cross) or lost (arrows terminate midflight before reaching a module).
此侧栏中显示的简单时序图描述了泵控制器与两个服务(传感器服务和泵服务)之间的交互。客户端向传感器服务发送包含消息“测量水箱液位”的请求,响应报告传感器读取的液位。还有第三条消息“启动泵”,当液位过高时,客户端会将其发送到泵服务。第二条消息没有响应。该图显示了三个操作:读取传感器、决定是否必须启动泵以及启动泵。图 7.7 [在线] 显示带有丢失消息的时序图,图 7.9 [在线] 显示带有延迟消息的时序图。
The simple timing diagram shown in this sidebar describes the interaction between a pump controller and two services: a sensor service and a pump service. There is a request containing the message “measure tank level” from the client to the sensor service, and a response reports the level read by the sensor. There is a third message, “start pump”, which the client sends to the pump service when the level is too high. The second message has no response. The diagram shows three actions: reading the sensor, deciding whether the pump must be started, and starting the pump. Figure 7.7 [on-line] shows a timing diagram with a lost message, and Figure 7.9 [on-line] shows one with a delayed message.
从概念上讲,客户端/服务模型在通过线路连接的单独计算机上运行客户端和服务。此实现还允许客户端和服务在地理上分开(这可能是件好事,因为它降低了由于断电等常见故障而导致两者同时失败的风险),并将所有交互限制为通过线路发送的明确定义的消息。
Conceptually, the client/service model runs client and services on separate computers, connected by a wire. This implementation also allows client and service to be separated geographically (which can be good because it reduces the risk that both fail owing to a common fault such as a power outage) and restricts all interactions to well-defined messages sent across a wire.
这种实现的缺点是每个模块都需要一台计算机,这在设备方面可能很昂贵。它还可能影响性能,因为从一台计算机向另一台计算机发送消息可能需要花费大量时间,特别是当计算机在地理位置上相距很远时。在某些情况下,这些缺点并不重要;对于确实重要的情况,第 5 章将解释如何使用操作系统在单台计算机内实现客户端/服务模型。在本章的其余部分,我们将假设客户端和服务各自拥有自己的计算机。
The disadvantage of this implementation is that it requires one computer per module, which may be costly in equipment. It may also have a performance cost because it may take a substantial amount of time to send a message from one computer to another, in particular if the computers are far away geographically. In some cases these disadvantages are unimportant; for cases in which it does matter, Chapter 5 will explain how to implement the client/service model within a single computer using an operating system. For the rest of this chapter we will assume that the client and the service each have their own computer.
为了实现高可用性或处理大工作负载,设计人员可能会选择使用多台计算机来实现服务。例如,文件服务可能会使用多台计算机来实现高容错能力;如果一台计算机发生故障,另一台计算机可以接管其角色。在单台计算机上运行的服务实例称为服务器。
To achieve high availability or handle big workloads, a designer may choose to implement a service using multiple computers. For instance, a file service might use several computers to achieve a high degree of fault tolerance; if one computer fails, another one can take over its role. An instance of a service running on a single computer is called a server.
为了使客户端/服务模型更加具体,让我们将MEASURE程序重新安排为一个简单的客户端/服务组织(见图4.4)。为了从服务获取时间,客户端过程构建一个请求消息,该消息命名服务并指定请求的操作和参数(第 2行和第 6行)。必须将请求的操作和参数转换为适合传输的表示形式。例如,客户端计算机可能是大端计算机(见边栏 4.3),而服务计算机可能是小端计算机。因此,客户端必须将参数转换为规范表示形式,以便服务可以解释这些参数。
To make the client/service model more concrete, let’s rearrange our MEASURE program into a simple client/service organization (see Figure 4.4). To get a time from the service, the client procedure builds a request message that names the service and specifies the requested operation and arguments (lines 2 and 6). The requested operation and arguments must be converted to a representation that is suitable for transmission. For example, the client computer may be a big-endian computer (see Sidebar 4.3), while the service computer may be a little-endian computer. Thus, the client must convert arguments into a canonical representation so that the service can interpret the arguments.
图 4.4客户端/服务应用程序示例:时间服务。
Figure 4.4 Example client/service application: time service.
边栏 4.3 表示大端还是小端?
Sidebar 4.3 Representation Big-Endian or Little-Endian?
对于字节中的位、字中的字节、页中的字等的编号,存在两种常见的约定。一种约定称为大端序,另一种约定称为小端序*。在大端序中,最高有效位、字节或字的编号为 0,并且位的有效性随着位的地址增加而降低:
Two common conventions exist for numbering bits within a byte, bytes within a word, words within a page, and the like. One convention is called big-endian, and the other little-endian*. In big-endian the most significant bit, byte, or word is numbered 0, and the significance of bits decreases as the address of the bit increases:
在 big-endian 中,十六进制数 ABCD hex将存储在内存中,因此,如果您按照内存地址的递增顺序从内存中读取,您将看到 ABCD。字符串“john”将作为 john 存储在内存中。
In big-endian the hex number ABCDhex would be stored in memory, so that if you read from memory in increasing memory address order, you would see A-B-C-D. The string “john” would be stored in memory as john.
在另一种约定即小端序中,最低有效位、字节或字的编号为 0,并且位的重要性随着位的地址的增加而增加:
In little-endian, the other convention, the least significant bit, byte, or word is numbered 0, and the significance of bits increases as the address of the bit increases:
在小端模式下,十六进制数 ABCD hex将存储在内存中,因此,如果您按内存地址递增顺序从内存中读取,则会看到 DCBA。字符串“john”仍将以 john 的形式存储在内存中。因此,从字符串中提取字节的代码可以在体系结构之间传输,但从整数中提取字节的代码则无法传输。
In little-endian, the hex number ABCDhex would be stored in memory, so that if you read from memory in increasing memory address order, you see D-C-B-A. The string “john” would still be stored in memory as john. Thus, code that extracts bytes from character strings transports between architectures, but code that extracts bytes from integers does not transport.
某些处理器(如 Intel x86 系列)使用小端约定,而其他处理器(如 IBM PowerPC 系列)则使用大端约定。正如 Danny Cohen 在经常被引用的文章“论圣战与和平恳求” [进一步阅读建议 7.2.4 ] 中指出的那样,设计人员使用哪种约定并不重要,只要在两个处理器之间通信时使用相同的约定即可。处理器必须就通过线路发送的位的编号约定达成一致(即,先发送最高有效位或先发送最低有效位)。因此,如果通信标准是大端(就像在 Internet 协议中一样),那么在小端处理器上运行的客户端必须按照大端顺序编组数据。本书使用大端约定。
Some processors, such as the Intel x86 family, use the little-endian convention, but others, such as the IBM PowerPC family, use the big-endian convention. As Danny Cohen pointed out in a frequently cited article “On holy wars and a plea for peace” [Suggestions for Further Reading 7.2.4], it doesn’t matter which convention a designer uses as long as it is the same one when communicating between two processors. The processors must agree on the convention for numbering the bits sent over the wire (that is, send the most significant bit first or send the least significant bit first). Thus, if the communication standard is big-endian (as it is in the Internet protocols), then a client running on a little-endian processor must marshal data in big-endian order. This book uses the big-endian convention.
本书还遵循位数从零开始的惯例。此选择与大端惯例无关;我们可以选择使用 1,就像某些处理器那样。
This book also follows the convention that bit numbers start with zero. This choice is independent of the big-endian convention; we could have chosen to use 1 instead, as some processors do.
这种转换称为编组。我们使用符号 { a , b } 来表示包含字段a和b的编组消息。编组通常涉及将对象转换为具有足够注释的字节数组,以便解组过程可以将其转换回语言对象。在此示例中,我们明确显示了编组和解组操作(例如,以CONVERT开头的过程调用),但在未来的许多示例中,这些操作将是隐式的,以避免混乱。
This conversion is called marshaling. We use the notation {a, b} to denote a marshaled message that contains the fields a and b. Marshaling typically involves converting an object into an array of bytes with enough annotation so that the unmarshal procedure can convert it back into a language object. In this example, we show the marshal and unmarshal operations explicitly (e.g., the procedure calls starting with CONVERT), but in many future examples these operations will be implicit to avoid clutter.
构造请求后,客户端发送请求(2和6),等待响应(第 3和7行),并解组时间(4和8)。
After constructing the request, the client sends it (2 and 6), waits for a response (line 3 and 7), and unmarshals the time (4 and 8).
服务过程等待请求(第 12行)并解组请求(第 13和14行)。然后,它检查请求(第 15行)、处理请求(第 16至19行),并发回已编组的响应(第20行)。
The service procedure waits for a request (line 12) and unmarshals the request (lines 13 and 14). Then, it checks the request (line 15), processes it (lines 16 through 19), and sends back a marshaled response (line 20).
客户端/服务组织不仅分离功能(抽象),还强制分离(强制模块化)。与使用过程调用的模块化相比,客户端/服务组织具有以下优势:
The client/service organization not only separates functions (abstraction), it also enforces that separation (enforced modularity). Compared to modularity using procedure calls, the client/service organization has the following advantages:
客户端和服务不依赖于消息以外的共享状态。因此,错误只能以一种方式从客户端传播到服务,反之亦然。如果服务(如第15行所示)和客户端检查请求和响应消息的有效性,那么它们就可以控制错误传播的方式。由于客户端和服务不依赖于全局共享数据结构(例如堆栈),因此客户端中的故障不能直接损坏服务中的数据,反之亦然。
The client and service don’t rely on shared state other than the messages. Therefore, errors can propagate from the client to the service, and vice versa, in only one way. If the services (as in line 15) and the clients check the validity of the request and response messages, then they can control the ways in which errors propagate. Since the client and service don’t rely on global, shared data structures such as a stack, a failure in the client cannot directly corrupt data in the service, and vice versa.
客户端和服务之间的事务是一种公平的事务。许多错误无法从一个传递到另一个。例如,客户端不必像使用过程调用那样信任服务返回适当的返回地址。再例如,参数和结果被编组和解组,允许客户端和服务检查它们。
The transaction between a client and a service is an arm’s-length transaction. Many errors cannot propagate from one to the other. For instance, the client does not have to trust the service to return to the appropriate return address, as it does using procedure calls. As another example, arguments and results are marshaled and unmarshaled, allowing the client and service to check them.
客户端甚至可以保护自己免受服务无法返回的影响,因为客户端可以设置等待响应的时间上限。因此,如果服务陷入无限循环,或失败并忘记请求,客户端可以检测到出现问题并采取一些恢复程序,例如尝试其他服务。另一方面,设置计时器可能会产生新的问题,因为很难预测等待多长时间是合理的。第 7.5.2 节 [在线] 详细讨论了为服务请求设置计时器的问题。在我们的示例中,客户端无法防御服务错误;提供这些防御措施会使程序稍微复杂一些,但可以帮助消除命运共享。
The client can protect itself even against a service that fails to return because the client can put an upper limit on the time it waits for a response. As a result, if the service gets into an infinite loop, or fails and forgets about the request, the client can detect that something has gone wrong and undertake some recovery procedure, such as trying a different service. On the other hand, setting timers can create new problems because it can be difficult to predict how long a wait is reasonable. The problem of setting timers for service requests is discussed in detail in Section 7.5.2 [on-line]. In our example, the client isn’t defensive against service errors; providing these defenses will make the program slightly more complex but can help eliminate fate sharing.
客户端/服务组织鼓励使用明确、定义良好的接口。由于客户端和服务只能通过消息进行交互,因此服务愿意接收的消息为服务提供了定义良好的接口。如果这些消息指定明确且其规范是公开的,那么程序员可以实现新的客户端或服务,而无需了解另一个客户端或服务的内部结构。清晰的规范允许不同的程序员实现客户端和服务,并且可以鼓励竞争以获得最佳实现。
Client/Service organization encourages explicit, well-defined interfaces. Because the client and service can interact only through messages, the messages that a service is willing to receive provide a well-defined interface for the service. If those messages are well specified and their specification is public, a programmer can implement a new client or service without having to understand the internals of another client or the service. Clear specification allows clients and service to be implemented by different programmers, and can encourage competition for the best implementation.
分离状态并传递定义明确的消息可减少潜在交互的数量,从而有助于控制错误。如果开发服务的程序员引入错误并且服务出现灾难,则客户端只会遇到可控的问题。客户端唯一关心的是服务未交付其合同的一部分;除了这个错误或缺失的值之外,客户端不关心自己的完整性。客户端不太容易受到服务故障的影响,或者换言之,可以减少命运共享。客户端可以基本不受服务故障的影响,反之亦然。
Separating state and passing well-defined messages reduce the number of potential interactions, which helps contain errors. If the programmer who developed the service introduces an error and the service has a disaster, the client has only a controlled problem. The client’s only concern is that the service didn’t deliver its part of the contract; apart from this wrong or missing value, the client has no concern for its own integrity. The client is less vulnerable from faults in the service, or, in slightly different words, fate sharing can be reduced. Clients can be mostly independent of service failures, and vice versa.
客户端/服务组织是一个全面简化的例子,因为该模型消除了除消息之外的所有形式的交互。通过使用消息传递将客户端和服务彼此分离,我们在它们之间创建了一道防火墙。就像建筑物中的防火墙一样,如果服务中发生火灾,它将被控制在服务中,并且假设客户端可以在响应中检查火焰,它就不会传播到客户端。如果客户端和服务得到很好的实现,那么从客户端到服务再返回的唯一方式是通过定义良好的消息。
The client/service organization is an example of a sweeping simplification because the model eliminates all forms of interaction other than messages. By separating the client and the service from each other using message passing, we have created a firewall between them. As with firewalls in buildings, if there is a fire in the service, it will be contained in the service, and, assuming the client can check for flames in the response, it will not propagate to the client. If the client and service are well implemented, then the only way to go from the client to the service and back is through well-defined messages.
当然,客户端/服务组织并不是万能的。如果服务返回了错误的结果,那么客户端就有问题。该客户端可以检查某些问题(例如语法问题),但不能检查所有语义错误。客户端/服务组织减少了命运共享,但并没有消除它。客户端/服务组织减少命运共享的程度还取决于客户端和服务之间的接口。举一个极端的例子,如果客户端/服务接口有一个消息,允许客户端将任何值写入服务地址空间中的任何地址,那么错误很容易从客户端传播到服务。系统设计人员的工作就是在客户端和服务之间定义一个良好的接口,使错误不能轻易传播。在本章和后面的章节中,我们将看到良好消息接口的示例。
Of course, the client/service organization is not a panacea. If a service returns an incorrect result, then the client has a problem. This client can check for certain problems (e.g., syntactic ones) but not all semantic errors. The client/service organization reduces fate sharing but doesn’t eliminate it. The degree to which the client/service organization reduces fate sharing is also dependent on the interface between the client and service. As an extreme example, if the client/service interface has a message that allows a client to write any value to any address in the service’s address space, then it is easy for errors to propagate from the client to the service. It is the job of the system designer to define a good interface between client and service so that errors cannot propagate easily. In this chapter and later chapters, we will see examples of good message interfaces.
为便于理解,本章中的大多数示例都展示了由单个过程组成的模块。在现实世界中,设计人员通常在粒度更大的软件模块之间应用客户端/服务组织。之所以倾向于使用更大的粒度,是因为应用程序中的过程通常需要紧密耦合,这是出于某些实际原因,例如它们都对相同的共享数据结构进行操作。将每个过程放在单独的客户端或服务中会使操作共享数据变得困难。因此,设计人员面临着模块所需数据的访问难易程度与模块内错误传播难易程度之间的权衡。设计人员通过决定将哪些数据和过程与其操作的数据组合成一个连贯的单元来进行权衡。该连贯单元随后成为单独的服务,错误包含在该单元中。客户端和服务单元通常是完整的应用程序或类似的大型子系统。
For ease of understanding, most of the examples in this chapter exhibit modules consisting of a single procedure. In the real world, designers usually apply the client/service organization between software modules of a larger granularity. The tendency toward larger granularity arises because the procedures within an application typically need to be tightly coupled for some practical reason, such as they all operate on the same shared data structure. Placing every procedure in a separate client or service would make it difficult to manipulate the shared data. The designer thus faces a trade-off between ease of accessing the data that a module needs and ease of error propagation within a module. A designer makes this trade-off by deciding which data and procedures to group into a coherent unit with the data that they manipulate. That coherent unit then becomes a separate service, and errors are contained within the unit. The client and service units are often complete application programs or similarly large subsystems.
是否将客户端/服务组织应用于两个模块的另一个因素是服务模块发生故障时的恢复计划。例如,在使用函数计算其参数的平方根的模拟器程序中,将该函数放入单独的服务中是没有意义的,因为它不会减少命运共享。如果平方根函数失败,模拟器程序将无法继续。此外,一个好的恢复计划是让程序员正确地重新实现该函数,而不是运行两个平方根服务器,并在第一个服务器发生故障时将故障转移到第二个服务器。在这个例子中,平方根函数可能是模拟器程序的一部分,因为客户端/服务组织不会减少模拟器程序的命运共享,因此没有理由使用它。
Another factor in whether or not to apply the client/service organization to two modules is the plan for recovery when the service module fails. For example, in a simulator program that uses a function to compute the square root of its argument, it makes little sense to put that function into a separate service because it doesn’t reduce fate sharing. If the square-root function fails, the simulator program cannot proceed. Furthermore, a good recovery plan is for the programmer to reimplement the function correctly, as opposed to running two square-root servers, and failing over to the second one when the first one fails. In this example, the square-root function might as well be part of the simulator program because the client/service organization doesn’t reduce fate sharing for the simulator program and thus there is no reason use it.
万维网 (World Wide Web) 是一个广泛使用的系统示例,该系统采用客户端/服务模式组织,客户端和服务通常在不同的计算机上运行。Web 浏览器是客户端,网站是服务。浏览器和网站通过定义明确的消息进行通信,并且通常在地理上是分开的。只要客户端和服务检查消息的有效性,服务故障就会导致浏览器出现受控问题,反之亦然。万维网提供了强制模块化。
A nice example of a widely used system that is organized in a client/service style, with the client and service typically running on separate computers, is the World Wide Web. The Web browser is a client, and a Web site is a service. The browser and the site communicate through well-defined messages and are typically geographically separated. As long as the client and service check the validity of messages, a failure of a service results in a controlled problem for the browser, and vice versa. The World Wide Web provides enforced modularity.
在图 4.3和4.4中,服务始终会做出响应,但这不是必需的。图 4.5显示了侧边栏 4.2中污水泵系统的泵控制器的伪代码。在此示例中,泵服务无需发送确认泵已关闭的回复。客户端关心的是来自独立传感器服务的确认,即水箱中的水位正在下降。如果泵发生故障,等待泵服务的回复(即使很短的时间)也只会延迟发出警报。
In Figures 4.3 and 4.4, the service always responds with a reply, but that is not a requirement. Figure 4.5 shows the pseudocode for a pump controller for the sewage pumping system in Sidebar 4.2. In this example, there is no need for the pump service to send a reply acknowledging that the pump was turned off. What the client cares about is a confirmation from an independent sensor service that the level in the tank is going down. Waiting for a reply from the pump service, even for a short time, would just delay sounding the alarm if the pump failed.
图 4.5客户端/服务应用程序示例:污水泵控制器。
Figure 4.5 Example client/service application: controller for a sewage pump.
其他系统出于性能原因避免响应消息。例如,流行的 X Window 系统(参见边栏 4.4)发送一系列请求,要求服务在屏幕上绘制某些内容,而这些请求单独不需要响应。
Other systems avoid response messages for performance reasons. For example, the popular X Window System (see Sidebar 4.4) sends a series of requests that ask the service to draw something on a screen and that individually have no need for a response.
边栏 4.4 X Window 系统
Sidebar 4.4 The X Window System
X Window 系统 [进一步阅读建议 4.2.2 ] 是几乎所有工程工作站和许多个人计算机的首选窗口系统。它提供了使用客户端/服务组织实现模块化的一个很好的例子。X Window 系统的主要贡献之一是它弥补了显示器取代打字机时UNIX系统出现的缺陷:显示器和键盘是UNIX应用程序编程接口中唯一依赖于硬件的部分。X Window 系统允许面向显示的UNIX应用程序完全独立于底层硬件。
The X Window System [Suggestions for Further Reading 4.2.2] is the window system of choice on practically every engineering workstation and many personal computers. It provides a good example of using the client/service organization to achieve modularity. One of the main contributions of the X Window System is that it remedied a defect that had crept into the UNIX system when displays replaced typewriters: the display and keyboard were the only hardware-dependent parts of the UNIX application programming interface. The X Window System allowed display-oriented UNIX applications to be completely independent of the underlying hardware.
X Window 系统通过将操作显示设备的服务程序与使用显示器的客户端程序分离来实现此属性。服务模块提供用于管理窗口、字体、鼠标光标和图像的接口。客户端可以通过高级操作请求这些资源的服务;例如,客户端执行线条、矩形、曲线等图形操作。这种分离的优点是客户端程序与设备无关。添加新的显示类型可能需要新的服务实现,但不需要更改应用程序。
The X Window System achieved this property by separating the service program that manipulates the display device from the client programs that use the display. The service module provides an interface to manage windows, fonts, mouse cursors, and images. Clients can request services for these resources through high-level operations; for example, clients perform graphics operations in terms of lines, rectangles, curves, and the like. The advantage of this split is that the client programs are device independent. The addition of a new display type may require a new service implementation, but no application changes are required.
客户/服务组织的另一个优点是,在一台机器上运行的应用程序可以使用另一台机器上的显示。例如,这种组织允许计算密集型程序在高性能超级计算机上运行,同时在用户的个人计算机上显示结果。
Another advantage of a client/service organization is that an application running on one machine can use the display on some other machine. This organization allows, for example, a computing-intensive program to run on a high-performance supercomputer, while displaying the results on a user’s personal computer.
服务必须能够应对客户端故障,否则,一个有缺陷的客户端可能会导致整个屏幕冻结。X Window 系统通过精心设计的远程过程调用(一种在4.2 节中描述的机制)让客户端和服务进行通信来实现此属性。远程过程调用具有以下属性:服务永远不必相信客户端会提供正确的数据,并且如果服务必须等待客户端,则可以处理其他客户端请求。
It is important that the service be robust to client failures because otherwise a buggy client could cause the entire display to freeze. The X Window system achieves this property by having client and service communicate through carefully designed remote procedure calls, a mechanism described in Section 4.2. The remote procedure calls have the property that the service never has to trust the clients to provide correct data and that the service can process other client requests if it has to wait for a client.
该服务允许客户端连续发送多个请求,而无需等待单个响应,因为本地显示器上显示的数据的速率通常高于客户端和服务之间的网络数据速率。如果客户端必须等待每个请求的响应,那么用户感知的性能将是不可接受的。例如,每个请求 80 个字符(典型显示器上的一行文本)和客户端与服务之间的往返时间为 5 毫秒,每秒只能绘制 16,000 个字符,而典型的硬件设备能够以更快的速度显示。
The service allows clients to send multiple requests back to back without waiting for individual responses because the rate at which data can be displayed on a local display is often higher than the network data rate between a client and service. If the client had to wait for a response on each request, then the user-perceived performance would be unacceptable. For example, at 80 characters per request (one line of text on a typical display) and a 5-millisecond round-trip time between client and service, only 16,000 characters per second can be drawn, while typical hardware devices are capable of displaying an order of magnitude faster.
在迄今为止的示例中,我们已经看到了一个客户端和一个服务,但是客户端/服务模型更加灵活:
In the examples so far, we have seen one client and one service, but the client/service model is much more flexible:
一个服务可以为多个客户端服务。打印机服务可以为多个客户端服务,这样就可以分担打印机的维护成本。文件服务可以为多个客户端存储文件,这样就可以共享文件中的信息。
One service can work for multiple clients. A printer service might work for many clients so that the cost of maintaining the printer can be shared. A file service might store files for many clients so that the information in the files can be shared.
一个客户端可以使用多种服务,例如污水泵控制器(见图4.5),它同时使用泵服务和传感器服务。
One client can use several services, as in the sewage pump controller (see Figure 4.5), which uses both a pump service and a sensor service.
单个模块可以同时充当客户端和服务的角色。打印机服务可能会将文档临时存储在文件服务上,直到打印机准备好打印为止。在这种情况下,打印服务充当打印请求的服务,但它也是文件服务的客户端。
A single module can take on the roles of both client and service. A printer service might temporarily store documents on a file service until the printer is ready to print. In this case, the print service functions as a service for printing requests, but it is also a client of the file service.
具有多个客户端的单个服务带来了另一种强制模块化技术:可信中介,即在多个可能相互怀疑的客户端之间充当可信第三方的服务。可信中介可以谨慎地控制共享资源。例如,文件服务可能为多个客户端存储文件,其中一些客户端相互怀疑;但是,客户端信任该服务会保持其事务独立。文件服务可以确保客户端无法访问不属于该客户端的文件,或者它可以根据客户端的指令允许某些客户端共享文件。
A single service that has multiple clients brings up another technique for enforcing modularity: the trusted intermediary, a service that functions as the trusted third party among multiple, perhaps mutually suspicious, clients. The trusted intermediary can control shared resources in a careful manner. For example, a file service might store files for multiple clients, some of which are mutually suspicious; the clients, however, trust the service to keep their affairs distinct. The file service could ensure that a client cannot have access to files not owned by that client, or it could, based on instructions from the clients, allow certain clients to share files.
可信中介在多个客户端之间强制模块化,并确保一个客户端的故障对另一个客户端的影响有限(或可能没有)。如果可信中介在多个客户端之间提供资源共享,则必须仔细设计和实施,以确保一个客户端的故障不会影响另一个客户端。例如,一个客户端对其私有文件的错误更新不应影响另一个客户端的私有文件。
The trusted intermediary enforces modularity among multiple clients and ensures that a fault in one client has limited (or perhaps no) effect on another client. If the trusted intermediary provides sharing of resources among multiple clients, then it has to be carefully designed and implemented to ensure that the failures of one client don’t affect another client. For example, an incorrect update made by one client to its private files shouldn’t affect the private files of another client.
文件服务只是受信任中介的一个示例。客户端/服务应用程序中的许多服务都是受信任中介。电子邮件服务为许多用户存储邮箱,以便个人用户不必担心丢失电子邮件。另一个例子是即时消息服务提供私人好友列表。通常,客户端需要某种形式的受控共享,而受信任中介可以提供这种共享。
A file service is only one example of a trusted intermediary. Many services in client/service applications are trusted intermediaries. E-mail services store mailboxes for many users so that individual users don’t have to worry about losing their e-mail. As another example, instant message services provide private buddy lists. It is usually the clients that need some form of controlled sharing, and trusted intermediaries can provide that.
在某些情况下,无需信任的中介也很有用。例如,第 4.2.3 节描述了如何使用不受信任的中介来缓冲和向多个收件人传递消息。这种用法允许除请求/响应之外的通信模式。
There are also situations in which intermediaries that do not have to be trusted are useful. For example, Section 4.2.3 describes how an untrusted intermediary can be used to buffer and deliver messages to multiple recipients. This use allows communication patterns other than request/response.
可信中介的另一个常见用途是通过让可信中介提供大多数功能来简化客户端。行业杂志中对此用途的流行词是“瘦客户端计算”。在这种用途中,只有可信中介必须在功能强大的计算机(或通过高速网络连接的计算机集合)上运行,因为客户端不运行复杂的功能。由于大多数应用程序中只有少数可信中介(与客户端数量相比),因此可以由专业人员管理它们并将它们放置在安全的机房中。这种类型的可信中介的设计、构建和运行成本可能很高,因为它们可能需要许多资源来支持客户端。如果不小心,当许多客户端同时请求服务时,可信中介可能会在突发人群中成为瓶颈。以额外的复杂性为代价,可以通过仔细划分客户端和可信中介之间的工作并使用第 8 章 [在线] 至第 10 章 [在线] 中描述的技术复制服务来避免此问题。
Another common use of trusted intermediaries is to simplify clients by having the trusted intermediary provide most functions. The buzzword in trade magazines for this use is “thin-client computing”. In this use, only the trusted intermediary must run on a powerful computer (or a collection of computers connected by a high-speed network) because the clients don’t run complex functions. Because in most applications there are a few trusted intermediaries (compared to the number of clients), they can be managed by a professional staff and located in a secure machine room. Trusted intermediaries of this type may be expensive to design, build, and run because they may need many resources to support the clients. If one isn’t careful, the trusted intermediary can become a choke point during a flash crowd when many clients ask for the service at the same time. At the cost of additional complexity, this problem can be avoided by carefully dividing the work between clients and the trusted intermediary and replicating services using the techniques described in Chapters 8 [on-line] through 10 [on-line].
具有受信任中介的设计也有一些普遍的缺点。受信任中介可能容易受到故障或攻击,从而导致服务中断。用户必须信任中介,但如果中介受到损害或受到审查怎么办?幸运的是,还有其他架构;请参阅边栏 4.5。
Designs that have trusted intermediaries also have some general downsides. The trusted intermediary may be vulnerable to failures or attacks that knock out the service. Users must trust the intermediary, but what if it is compromised or is subjected to censorship? Fortunately, there are alternative architectures; see Sidebar 4.5.
边栏 4.5 无需可信中介的对等计算
Sidebar 4.5 Peer-to-peer Computing without Trusted Intermediaries
对等是一种缺乏可信中介的分散式设计。它是最古老的设计之一,已被互联网电子邮件系统、互联网新闻公告服务、用于路由互联网数据包的互联网服务提供商以及 IBM 的系统网络架构等所采用。最近,它受到了大众媒体的广泛关注,因为文件共享应用程序重新发现了它的一些优势。
Peer-to-peer is a decentralized design that lacks trusted intermediaries. It is one of the oldest designs and has been used by, for example, the Internet e-mail system, the Internet news bulletin service, Internet service providers to route Internet packets, and IBM’s Systems Network Architecture. Recently, it has received much attention in the popular press because file-sharing applications have rediscovered some of its advantages.
在对等应用程序中,参与应用程序的每台计算机都是对等计算机,其功能(但容量可能不同)与其他计算机相同。也就是说,没有哪个对等计算机比其他对等计算机更重要;如果一个对等计算机发生故障,则此故障可能会降低应用程序的性能,但不会导致应用程序故障。客户端/服务组织不具备此属性:如果服务发生故障,则应用程序会故障,即使所有客户端计算机均正常运行。
In a peer-to-peer application, every computer participating in the application is a peer and is equal in function (but perhaps not in capacity) to any other computer. That is, no peer is more important than any other peer; if one peer fails, then this failure may degrade the performance of the application, but it won’t fail the application. The client/service organization doesn’t have this property: if the service fails, the application fails, even if all client computers are operational.
UsenetNews 是较老的对等应用程序的一个很好的例子。UsenetNews 是一个在线新闻简报,是最早的对等应用程序之一,自 20 世纪 80 年代开始运行。用户在新闻组上发帖,其他用户从中阅读文章并做出回应。UsenetNews 中的节点将新闻组传播给对等节点,并将文章提供给客户端。节点的管理员决定管理员的节点与哪些节点对等。由于大多数节点与其他几个节点互连,因此系统具有容错能力,一个节点的故障最多只会导致性能下降,而不是完全故障。由于节点分布在世界各地不同的管辖区,任何一个中央机构都很难审查内容(但节点的管理员可以决定不承载某个组)。由于这些特性,设计人员建议以对等方式组织其他应用程序。例如,LOCKSS [进一步阅读建议 10.2.3 ] 已经以这种风格建立了一个强大的数字图书馆。
UsenetNews is a good example of an older peer-to-peer application. UsenetNews, an on-line news bulletin, is one of the first peer-to-peer applications and has been operational since the 1980s. Users post to a newsgroup, from which other users read articles and respond. Nodes in UsenetNews propagate newsgroups to peers and serve articles to clients. An administrator of a node decides with which nodes the administrator’s node peers. Because most nodes interconnect with several other nodes, the system is fault tolerant, and the failure of one node leads at most to a performance degradation rather than to a complete failure. Because the nodes are spread across the world in different jurisdictions, it is difficult for any one central authority to censor content (but an administrator of a node can decide not to carry a group). Because of these properties, designers have proposed organizing other applications in a peer-to-peer style. For example, LOCKSS [Suggestions for Further Reading 10.2.3] has built a robust digital library in that style.
最近,音乐共享应用程序和技术的进步使点对点设计成为人们关注的焦点。如今,客户端计算机与过去的服务计算机一样强大,并且通过高速数据链路连接到互联网。在音乐共享应用程序中,客户端是点对点的,它们彼此提供和存储音乐。该组织聚合了所有客户端的磁盘空间和网络链接,以提供大量存储和网络容量,允许存储和提供许多歌曲。正如计算机系统历史上经常发生的那样,该应用程序的第一个版本不是由计算机科学家开发的,而是由开发 Napster 的 18 岁年轻人肖恩·范宁 (Shawn Fanning) 开发的。它(及其后续产品)改变了互联网上的网络流量特征,也引发了法律问题。
Recently, music-sharing applications and improvements in technology have brought peer-to-peer designs into the spotlight. Today, client computers are as powerful as yesterday’s computers for services and are connected with high data-rate links to the Internet. In music-sharing applications the clients are peers, and they serve and store music for one another. This organization aggregates the disk space and network links of all clients to provide a tremendous amount of storage and network capacity, allowing many songs to be stored and served. As often happens in the history of computer systems, the first version of this application was developed not by a computer scientist but by an 18-year-old, Shawn Fanning, who developed Napster. It (and its successors) has changed the characteristics of network traffic on the Internet and has raised legal questions as well.
在 Napster 中,客户端提供并存储歌曲,但受信任的中介存储歌曲的位置。由于 Napster 被用于非法音乐共享,美国唱片业协会 (RIAA) 起诉了中介的运营商并将其关闭。在最近的点对点设计中,开发人员采用了抗审查应用程序的设计,并避免使用受信任的中介来定位歌曲。在这些 Napster 的后续产品中,对等点通过查询其他对等点来定位音乐;如果任何单个节点被关闭,它不会使服务不可用。RIAA 现在必须起诉个人用户。
In Napster, clients serve and store songs, but a trusted intermediary stores the location of a song. Because Napster was used for illegal music sharing, the Recording Industry Association of America (RIAA) sued the operators of the intermediary and was able to shut it down. In more recent peer-to-peer designs, developers adopted the design of censor-resistant applications and avoided the use of a trusted intermediary to locate songs. In these successors to Napster, the peers locate music by querying other peers; if any individual node is shut down, it will not render the service unavailable. The RIAA must now sue individual users.
在没有可信中介的情况下,在大型对等网络中准确快速地查找信息是一个难题。没有中介,就没有中央的、众所周知的计算机来跟踪歌曲的位置。分布式算法对于查找歌曲是必不可少的。一种简单的算法是向所有邻居对等点发送歌曲查询;如果他们没有副本,对等点会将查询转发给他们的邻居,依此类推。这种算法有效,但效率低下,因为它会向网络中的每个节点发送查询。为了避免每次查询都淹没对等点网络,可以在查询被转发多次后停止转发。以这种方式限制搜索可能会导致某些查询不返回任何答案,即使歌曲在网络中的某个地方。
Accurately and quickly finding information in a large network of peers without a trusted intermediary is a difficult problem. Without an intermediary there is no central, well-known computer to track the locations of songs. A distributed algorithm is necessary to find a song. A simple algorithm is to send a query for a song to all neighbor peers; if they don’t have a copy, the peers forward the query to their neighbors, and so on. This algorithm works, but it is inefficient because it sends a query to every node in the network. To avoid flooding the network of peers on each query, one can stop forwarding the query after it has been forwarded a number of times. Bounding a search in this way may cause some queries to return no answer, even though the song is somewhere in the network.
这个问题引起了研究界的关注,促使人们发明了更好的去中心化搜索服务算法,并催生了一系列新的点对点应用。其中一些主题已在问题集中涵盖;例如,请参见问题集20 [在线] 和23 [在线]。
This problem has sparked interest in the research community, leading to the invention of better algorithms for decentralized search services and resulting in a range of new peer-to-peer applications. Some of these topics are covered in problem sets; see, for example, problem sets 20[on-line] and 23[on-line].
图 4.6显示了图 2.18中以客户端/服务形式组织的文件系统示例,以及客户端和服务之间的消息。编辑器是文件服务的客户端,而文件服务又是块存储服务的客户端。该图使用消息时序图显示了这三个模块之间的消息交互。
Figure 4.6 shows the file system example of Figure 2.18 organized in a client/service style, along with the messages between the clients and services. The editor is a client of the file service, which in turn is a client of the block-storage service. The figure shows the message interaction among these three modules using a message timing diagram.
图4.6文件服务使用消息时序图。
Figure 4.6 File service using message timing diagram.
在所示示例中,客户端构造一条OPEN消息,指定要打开的文件的名称。服务会检查权限,如果允许用户访问,则发回响应,指示成功 (OK) 和文件指针的值 (0)(有关文件指针的说明,请参阅第 2.3.2 节)。客户端将文本写入文件,这会导致WRITE请求,该请求指定要写入的文本和字节数。服务通过在块服务上分配块来写入文件,将指定的字节复制到其中,并返回一条消息,说明写入的字节数 (16)。在收到块服务的响应后,它会为客户端构造一个响应,指示成功并通知客户端文件指针的新值。当客户端完成编辑时,客户端会发送一条CLOSE消息,告诉服务客户端已完成此文件的编辑。
In the depicted example, the client constructs an OPEN message, specifying the name of the file to be opened. The service checks permissions and, if the user is allowed access, sends a response back indicating success (OK) and the value of the file pointer (0) (see Section 2.3.2 for an explanation of a file pointer). The client writes text to the file, which results in a WRITE request that specifies the text and the number of bytes to be written. The service writes the file by allocating blocks on the block service, copies the specified bytes into them, and returns a message stating the number of bytes written (16). After receiving a response from the block service, it constructs a response for the client, indicating success and informing the client of the new value of the file pointer. When the client is done editing, the client sends a CLOSE message, telling the service that the client is finished with this file.
这个消息序列对于实际使用来说太简单了,因为它不处理故障(例如,如果服务在处理写请求时失败会发生什么)、并发(例如,如果多个客户端更新共享文件会发生什么)或安全性(例如,如何确保恶意的人无法写入商业计划)。几乎这么简单的文件服务是伍德斯托克文件系统(WFS),由施乐帕洛阿尔托研究中心的研究人员设计 [进一步阅读建议 4.2.1 ]。第 4.5 节是一个广泛使用的后继者网络文件系统(NFS)的案例研究,它以客户端/服务应用程序的形式组织,并总结了 NFS 如何处理故障和并发。处理并发、故障和安全性是我们在后面的章节中系统地探讨的主题。
This message sequence is too simple for use in practice because it doesn’t deal with failures (e.g., what happens if the service fails while processing a write request), concurrency (e.g., what happens if multiple clients update a shared file), or security (e.g., how to ensure that a malicious person cannot write the business plan). A file service that is almost this simple is the Woodstock File System (WFS), designed by researchers at the Xerox Palo Alto Research Center [Suggestions for Further Reading 4.2.1]. Section 4.5 is a case study of a widely used successor, the Network File System (NFS), which is organized as a client/service application, and summarizes how NFS handles failures and concurrency. Handling concurrency, failures, and security in general are topics we explore in a systematic way in later chapters.
文件服务是一个值得信赖的中介,因为它保护文件的内容。它必须检查消息是否来自合法客户端(而不是攻击者),它决定客户端是否有权执行请求的操作,如果有,它就执行该操作。因此,只要文件服务正确完成其工作,客户端就可以以受保护的方式共享文件(从而共享块存储服务)。
The file service is a trusted intermediary because it protects the content of files. It must check whether the messages came from a legitimate client (and not from an attacker), it decides whether the client has permission to perform the requested operation, and, if so, it performs the operation. Thus, as long as the file service does its job correctly, clients can share files (and thus also the block-storage service) in a protected manner.
本节介绍发送和接收消息的两种扩展。首先,它介绍了远程过程调用 (RPC),这是一种客户端/服务交互的程式化形式,其中每个请求后面都有一个响应。RPC 系统的目标是使远程过程调用看起来像普通过程调用。但是,由于服务失败与客户端无关,因此远程过程调用通常不能提供与过程调用相同的语义。如下一小节所述,一些 RPC 系统提供了各种替代语义,程序员必须了解这些细节。
This section describes two extensions to sending and receiving messages. First, it introduces remote procedure call (RPC), a stylized form of client/service interaction in which each request is followed by a response. The goal of RPC systems is to make a remote procedure call look like an ordinary procedure call. Because a service fails independently from a client, however, a remote procedure call can generally not offer identical semantics to procedure calls. As explained in the next subsection, some RPC systems provide various alternative semantics and the programmer must be aware of the details.
其次,在某些应用中,人们希望能够向不在线的收件人发送消息,并从不在线的发件人那里接收消息。例如,电子邮件允许用户发送电子邮件,而无需收件人在线。使用中介进行通信,我们可以实现这些应用。
Second, in some applications it is desirable to be able to send messages to a recipient that is not on-line and to receive messages from a sender that is not on-line. For example, electronic mail allows users to send e-mail without requiring the recipient to be on-line. Using an intermediary for communication, we can implement these applications.
在上一节的许多示例中,客户端和服务以一种程式化的方式进行交互:客户端发送请求,服务在处理完客户端的请求后回复响应。这种交互方式非常常见,甚至有了自己的名称:远程过程调用(简称 RPC)。
In many of the examples in the previous section, the client and service interact in a stylized fashion: the client sends a request, and the service replies with a response after processing the client’s request. This style is so common that it has received its own name: remote procedure call, or RPC for short.
RPC 种类繁多,为基本的请求/响应式交互添加了功能。例如,某些 RPC 系统通过隐藏构造和格式化消息的许多细节来简化客户端和服务的编程。在上面的时间服务示例中,程序员必须调用SEND_MESSAGE和RECEIVE_MESSAGE,并将结果转换为数字,等等。同样,在文件服务示例中,客户端和服务必须构造消息并将数字转换为位字符串等。编写这些转换程序非常繁琐且容易出错。
RPCs come in many varieties, adding features to the basic request/response style of interaction. Some RPC systems, for example, simplify the programming of clients and services by hiding many the details of constructing and formatting messages. In the time service example above, the programmer must call SEND_MESSAGE and RECEIVE_MESSAGE, and convert results into numbers, and so on. Similarly, in the file service example, the client and service have to construct messages and convert numbers into bit strings and the like. Programming these conversions is tedious and error prone.
存根免除了程序员的这个负担(参见图 4.7)。存根是一个对调用者和被调用者隐藏编组和通信细节的过程。 RPC 系统可以按如下方式使用存根。客户端模块调用远程过程(例如GET_TIME),方式与调用任何其他过程相同。但GET_TIME实际上只是在客户端模块内运行的存根过程的名称(参见图 4.8)。存根将调用的参数编组为消息,发送该消息,然后等待响应。收到响应后,客户端存根将解组响应并返回给调用者。
Stubs remove this burden from the programmer (see Figure 4.7). A stub is a procedure that hides the marshaling and communication details from the caller and callee. An RPC system can use stubs as follows. The client module invokes a remote procedure, say GET_TIME, in the same way that it would call any other procedure. However, GET_TIME is actually just the name of a stub procedure that runs inside the client module (see Figure 4.8). The stub marshals the arguments of a call into a message, sends the message, and waits for a response. On arrival of the response, the client stub unmarshals the response and returns to the caller.
图 4.7使用存根实现远程过程调用。存根隐藏了调用者和被调用者的所有远程通信。
Figure 4.7 Implementation of a remote procedure call using stubs. The stubs hide all remote communication from the caller and callee.
图 4.8使用存根的 GET_TIME客户端和服务。
Figure 4.8 GET_TIME client and service using stubs.
类似地,服务存根等待消息、解组参数并调用客户端请求的过程(示例中为GET_TIME)。过程返回后,服务存根将过程调用的结果编组为消息并将其作为响应发送给客户端存根。
Similarly, a service stub waits for a message, unmarshals the arguments, and calls the procedure that the client requests (GET_TIME in the example). After the procedure returns, the service stub marshals the results of the procedure call into a message and sends it in a response to the client stub.
编写存根来将更复杂的对象转换为合适的在线表示变得非常繁琐。某些高级编程语言(例如 Java)可以根据接口规范 [进一步阅读的建议 4.1.3 ] 自动生成这些存根,从而进一步简化客户端/服务编程。图 4.9显示了此类 RPC 系统的客户端。RPC 系统将生成类似于图 4.8中的GET_TIME存根的过程。图 4.9中的客户端程序看起来与第 149 页上使用本地过程调用的程序几乎相同,只是它处理了一个额外的错误,因为远程过程调用与过程调用并不相同(如下所述)。服务在第 7行调用的过程只是第 149 页上的原始过程GET_TIME。
Writing stubs that convert more complex objects into an appropriate on-wire representation becomes quite tedious. Some high-level programming languages such as Java can generate these stubs automatically from an interface specification [Suggestions for Further Reading 4.1.3], simplifying client/service programming even further. Figure 4.9 shows the client for such an RPC system. The RPC system would generate a procedure similar to the GET_TIME stub in Figure 4.8. The client program of Figure 4.9 looks almost identical to the one using a local procedure call on page 149, except that it handles an additional error because remote procedure calls are not identical to procedure calls (as discussed below). The procedure that the service calls on line 7 is just the original procedure GET_TIME on page 149.
图 4.9 GET_TIME客户端使用自动生成 RPC 存根的系统。
Figure 4.9 GET_TIME client using a system that generates RPC stubs automatically.
系统是否使用具有自动存根生成的 RPC 取决于实现者。例如,Sun 的网络文件系统 (参见第 4.5 节) 的一些实现使用自动存根生成,但其他实现则不使用。
Whether a system uses RPC with automatic stub generation is up to the implementers. For example, some implementations of Sun’s Network File System (see Section 4.5) use automatic stub generation, but others do not.
人们很容易认为使用存根可以使远程过程调用的行为与普通过程调用完全相同,这样程序员就不必考虑过程是在本地运行还是在远程运行。事实上,这个目标在最初提出 RPC 时就是主要目标 — — 因此得名远程“过程调用”。然而,RPC 在三个重要方面不同于普通过程调用:首先,RPC 可以通过向调用者公开被调用者的故障来减少调用者和被调用者之间的命运共享,以便调用者可以恢复。其次,RPC 引入了过程调用中不会出现的新故障。与普通过程调用相比,这两个差异改变了远程过程调用的语义,这些更改通常要求程序员对周围的代码进行调整。第三,远程过程调用比过程调用花费更多时间;调用一个过程的指令数(见图4.2)远少于调用存根、编组参数、通过网络发送请求、调用服务存根、解组参数、编组响应、通过网络接收响应以及解组响应的成本。
It is tempting to think that by using stubs one can make a remote procedure call behave exactly the same as an ordinary procedure call, so that a programmer doesn’t have to think about whether the procedure runs locally or remotely. In fact, this goal was a primary one when RPC was originally proposed—hence the name remote “procedure call”. However, RPCs are different from ordinary procedure calls in three important ways: First, RPCs can reduce fate sharing between caller and callee by exposing the failures of the callee to the caller so that the caller can recover. Second, RPCs introduce new failures that don’t appear in procedure calls. These two differences change the semantics of remote procedure calls as compared with ordinary procedure calls, and the changes usually require the programmer to make adjustments to the surrounding code. Third, remote procedure calls take more time than procedure calls; the number of instructions to invoke a procedure (see Figure 4.2) is much less than the cost of invoking a stub, marshaling arguments, sending a request over a network, invoking a service stub, unmarshaling arguments, marshaling the response, receiving the response over the network, and unmarshaling the response.
为了说明第一个区别,考虑编写一个对库程序SQRT的过程调用,该程序计算其参数x的平方根。细心的程序员会为SQRT ( x )在x为负时失败的情况做好准备,并为这种情况提供显式异常处理程序。但是,使用普通过程调用的程序员几乎肯定不会费心为某些可能的失败做计划,因为它们的概率可以忽略不计。例如,程序员在调用SQRT ( x )时可能不会想到设置间隔计时器,即使SQRT内部有一个逐次逼近循环,如果编程错误,可能不会终止。
To illustrate the first difference, consider writing a procedure call to the library program SQRT, which computes the square root of its argument x. A careful programmer would plan for the case that SQRT (x) will fail when x is negative by providing an explicit exception handler for that case. However, the programmer using ordinary procedure calls almost certainly doesn’t go to the trouble of planning for certain possible failures because they have negligible probability. For example, the programmer probably would not think of setting an interval timer when invoking SQRT (x), even though SQRT internally has a successive-approximation loop that, if programmed wrong, might not terminate.
但现在考虑使用 RPC 调用SQRT。间隔计时器突然变得必不可少,因为客户端和服务之间的网络可能会丢失消息,或者另一台计算机可能会独立崩溃。为了避免命运共享,RPC 程序员必须调整代码以准备和处理此故障。当客户端收到“服务故障”信号时,客户端可能能够通过尝试其他服务或选择不使用远程服务的替代算法来恢复。
But now consider calling SQRT with an RPC. An interval timer suddenly becomes essential because the network between client and service can lose a message, or the other computer can crash independently. To avoid fate sharing, the RPC programmer must adjust the code to prepare for and handle this failure. When the client receives a “service failure” signal, the client may be able to recover by, for example, trying a different service or choosing an alternative algorithm that doesn’t use a remote service.
普通过程调用和 RPC 之间的第二个区别是 RPC 引入了一种新的故障模式,即“无响应”故障。当服务没有响应时,客户端无法判断以下两种情况中的哪一种出了问题:(1) 在服务有机会执行请求的操作之前发生了某种故障,或者 (2) 服务执行了操作,然后发生了故障,导致响应丢失。
The second difference between ordinary procedure calls and RPCs is that RPCs introduce a new failure mode, the “no response” failure. When there is no response from a service, the client cannot tell which of two things went wrong: (1) some failure occurred before the service had a chance to perform the requested action, or (2) the service performed the action and then a failure occurred, causing just the response to be lost.
大多数 RPC 设计通过选择以下三种实现策略之一来处理无响应情况:
Most RPC designs handle the no-response case by choosing one of three implementation strategies:
至少一次RPC。如果客户端存根在特定时间内未收到响应,则存根会根据需要多次重新发送请求,直到收到服务的响应。此实现可能会导致服务多次执行请求。对于调用SQRT的应用程序,多次执行请求是无害的,因为使用相同的参数SQRT应该始终产生相同的答案。用编程语言术语来说,SQRT服务没有副作用。这种无副作用的操作也是幂等的:重复相同的请求或请求序列多次与只执行一次具有相同的效果。至少一次实现并不提供其名称所暗示的保证。例如,如果服务位于被飓风吹走的建筑物中,重试无济于事。为了处理这种情况,至少一次 RPC 实现将在重试一定次数后放弃。当发生这种情况时,请求可能被执行多次,或者根本没有执行。
At-least-once RPC. If the client stub doesn’t receive a response within some specific time, the stub resends the request as many times as necessary until it receives a response from the service. This implementation may cause the service to execute a request more than once. For applications that call SQRT, executing the request more than once is harmless because with the same argument SQRT should always produce the same answer. In programming language terms, the SQRT service has no side effects. Such side-effect-free operations are also idempotent: repeating the same request or sequence of requests several times has the same effect as doing it just once. An at-least-once implementation does not provide the guarantee implied by its name. For example, if the service was located in a building that has been blown away by a hurricane, retrying doesn’t help. To handle such cases, an at-least-once RPC implementation will give up after some number of retries. When that happens, the request may have been executed more than once or not at all.
最多一次RPC。如果客户端存根在特定时间内未收到响应,则客户端存根将向调用者返回错误,表明服务可能已处理或可能未处理该请求。最多一次语义可能更适合具有副作用的请求。例如,在银行应用程序中,使用至少一次语义将 100 美元从一个帐户转移到另一个帐户的请求可能会导致多次 100 美元的转账。使用最多一次语义可确保发生零次或一次转账,这是一种更可控的结果。实现最多一次 RPC 比听起来要难,因为底层网络可能会在客户端存根不知情的情况下复制请求消息。第 7 章 [在线] 描述了最多一次实现,Birrell 和 Nelson 的论文对实现最多一次的 RPC 系统进行了很好的完整描述 [进一步阅读建议 4.1.1 ]。
At-most-once RPC. If the client stub doesn’t receive a response within some specific time, then the client stub returns an error to the caller, indicating that the service may or may not have processed the request. At-most-once semantics may be more appropriate for requests that do have side effects. For example, in a banking application, using at-least-once semantics for a request to transfer $100 from one account to another could result in multiple $100 transfers. Using at-most-once semantics assures that either zero or one transfers take place, a somewhat more controlled outcome. Implementing at-most-once RPC is harder than it sounds because the underlying network may duplicate the request message without the client stub’s knowledge. Chapter 7 [on-line] describes an at-most-once implementation, and Birrell and Nelson’s paper gives a nice, complete description of an RPC system that implements at-most-once [Suggestions for Further Reading 4.1.1].
精确一次RPC。这些语义是理想的,但由于客户端和服务是独立的,因此原则上不可能保证。与至少一次的情况一样,如果服务位于被飓风吹走的建筑物中,客户端存根能做的最好的事情就是返回错误状态。另一方面,通过增加额外消息交换和仔细记录的复杂性,可以接近精确一次语义,足以满足某些应用程序的需求。一般的想法是,如果请求将 100 美元从帐户 A 转移到 B 的 RPC 产生“无响应”故障,则客户端存根会向服务发送单独的 RPC 请求,以询问未得到响应的请求的状态。此解决方案要求客户端和服务存根都仔细记录每个远程过程调用请求和响应。这些记录必须是容错的,因为运行服务的计算机可能会在原始 RPC 和检查 RPC 状态的查询之间发生故障并丢失其状态。第 8 章 [在线] 至第 10 章 [在线] 介绍了必要的技术。
Exactly-once RPC. These semantics are the ideal, but because the client and service are independent it is in principle impossible to guarantee. As in the case of at-least-once, if the service is in a building that was blown away by a hurricane, the best the client stub can do is return error status. On the other hand, by adding the complexity of extra message exchanges and careful recordkeeping, one can approach exactly-once semantics closely enough to satisfy some applications. The general idea is that, if the RPC requesting transfer of $100 from account A to B produces a “no response” failure, the client stub sends a separate RPC request to the service to ask about the status of the request that got no response. This solution requires that both the client and the service stubs keep careful records of each remote procedure call request and response. These records must be fault tolerant because the computer running the service might fail and lose its state between the original RPC and the inquiry to check on the RPC’s status. Chapters 8 [on-line] through 10 [on-line] introduce the necessary techniques.
程序员必须意识到 RPC 语义与普通过程调用的语义不同,而且由于不同的 RPC 系统以不同的方式处理无响应情况,因此了解任何特定 RPC 系统试图提供的语义非常重要。即使实现的名称暗示了保证(例如,至少一次),我们也看到有些情况下实现无法提供保证。不能简单地收集遗留程序并任意将模块与 RPC 分开。不可避免地需要一些思考和重新编程。问题集2探讨了不同 RPC 语义在简单客户端/服务应用程序上下文中的影响。
The programmer must be aware that RPC semantics differ from those of ordinary procedure calls, and because different RPC systems handle the no-response case in different ways, it is important to understand just which semantics any particular RPC system tries to provide. Even if the name of the implementation implies a guarantee (e.g., at-least-once), we have seen that there are cases in which the implementation cannot deliver it. One cannot simply take a collection of legacy programs and arbitrarily separate the modules with RPC. Some thought and reprogramming is inevitably required. Problem set 2 explores the effects of different RPC semantics in the context of a simple client/service application.
第三个区别是,调用本地过程通常比调用远程过程调用花费的时间少得多。例如,调用远程SQRT可能比计算SQRT本身更昂贵,因为远程过程调用的开销比遵循过程调用约定的开销高得多。为了隐藏远程过程调用的成本,客户端存根可以部署各种性能增强技术(参见第 6 章),如缓存结果和流水线请求(如在Sidebar 4.4的 X Window 系统中所做的那样)。这些技术增加了复杂性并可能引入新的问题(例如,如何确保客户端的缓存与服务端的缓存保持一致)。过程调用和远程过程调用之间的性能差异要求设计人员仔细考虑哪些过程调用应该是远程过程调用,哪些应该是普通的本地过程调用。
The third difference is that calling a local procedure takes typically much less time than calling a remote procedure call. For example, invoking a remote SQRT is likely to be more expensive than the computation for SQRT itself because the overhead of a remote procedure call is much higher than the overhead of following the procedure calling conventions. To hide the cost of a remote procedure call, a client stub may deploy various performance-enhancing techniques (see Chapter 6), such as caching results and pipelining requests (as is done in the X Window System of Sidebar 4.4). These techniques increase complexity and can introduce new problems (e.g., how to ensure that the cache at the client stays consistent with the one at the service). The performance difference between procedure calls and remote procedure calls requires the designer to consider carefully what procedure calls should be remote ones and which ones should be ordinary, local procedure calls.
过程调用和 RPC 之间的最后一个区别是,某些编程语言功能与 RPC 不能很好地结合。例如,通过全局变量与另一个过程通信的过程通常无法远程执行,因为单独的计算机通常具有单独的地址空间。同样,使用显式地址的其他语言构造也将无法工作。例如,由包含指针的数据结构组成的参数是一个问题,因为指向客户端计算机中对象的指针是本地地址,在服务计算机中解析时具有不同的绑定。可以设计使用全局引用的系统,用于通过引用传递给远程过程调用的对象,但需要大量额外的机制并引入新的问题。例如,需要一个新的计划来确定是否可以本地删除对象,因为远程计算机可能仍具有对该对象的引用。但是,解决方案是存在的;例如,请参阅有关网络对象的文章 [进一步阅读建议 4.1.2 ]。
A final difference between procedure calls and RPCs is that some programming language features don’t combine well with RPC. For example, a procedure that communicates with another procedure through global variables cannot typically be executed remotely because separate computers usually have separate address spaces. Similarly, other language constructs that use explicit addresses won’t work. Arguments consisting of data structures that contain pointers, for example, are a problem because pointers to objects in the client computer are local addresses that have different bindings when resolved in the service computer. It is possible to design systems that use global references for objects that are passed by reference to remote procedure calls but require significant additional machinery and introduce new problems. For example, a new plan is needed for determining whether an object can be deleted locally because a remote computer might still have a reference to the object. Solutions exist, however; see, for example, the article on Network Objects [Suggestions for Further Reading 4.1.2].
由于 RPC 提供的语义与过程调用不同,“远程过程调用”中的“过程”一词可能会产生误导。多年来,RPC 的概念已从最初的解释(即对普通过程调用的精确模拟)演变为表示任何客户端/服务交互(其中请求后跟响应)。本文采用这种现代解释。
Since RPCs don’t provide the same semantics as procedure calls, the word “procedure” in “remote procedure call” can be misleading. Over the years the concept of RPC has evolved from its original interpretation as an exact simulation of an ordinary procedure call to instead mean any client/service interaction in which the request is followed by a response. This text uses this modern interpretation.
发送者向接收者发送消息需要双方同时在线。在许多应用中,这个要求太严格了。例如,在电子邮件中,我们希望用户能够向接收者发送电子邮件,即使接收者当时不在线。发送者发送消息,接收者稍后收到消息,也许发送者不在线。我们可以使用中介来实现这样的应用。在通信的情况下,这个中介不必是可信的,因为通信应用程序通常将中介视为不可信网络的一部分,并有一个单独的消息保护计划(我们将在第 11 章 [在线] 中看到)。
Sending a message from a sender to a receiver requires that both parties be available at the same time. In many applications this requirement is too strict. For example, in electronic mail we desire that a user be able to send an e-mail to a recipient even if the recipient is not on-line at the time. The sender sends the message and the recipient receives the message some time later, perhaps when the sender is not on-line. We can implement such applications using an intermediary. In the case of communication, this intermediary doesn’t have to be trusted because communication applications often consider the intermediary to be part of an untrusted network and have a separate plan for securing messages (as we will see in Chapter 11 [on-line]).
电子邮件中介的主要目的是实现缓冲通信。缓冲通信提供了发送/接收抽象,但避免了发送者和接收者同时在场的要求。它允许消息的传递随时间而转移。中介可以保留消息,直到收件人上线。中介可以在易失性存储器或非易失性存储器(如文件系统)中缓冲消息。后一种设计允许中介在电源故障时缓冲消息。
The primary purpose of the e-mail intermediary is to implement buffered communication. Buffered communication provides the SEND/RECEIVE abstraction but avoids the requirement that the sender and receiver be present simultaneously. It allows the delivery of a message to be shifted in time. The intermediary can hold messages until the recipient comes on-line. The intermediary might buffer messages in volatile memory or in non-volatile memory, such as a file system. The latter design allows the intermediary to buffer messages across power failures.
一旦我们有了中介,就会出现三个有趣的设计机会。首先,发送者和接收者可以对推送或拉取消息做出不同的选择。推送是指数据移动的发起者发送数据。拉取是指数据移动的发起者要求另一端向其发送数据。这些定义与系统是否使用中介无关,但在有中介的系统中,在一个系统中同时使用两者并不罕见。例如,互联网电子邮件系统简单邮件传输协议 (SMTP) 中的发送者将邮件推送到保存收件人邮箱的服务。另一方面,接收客户端拉取消息以从邮箱中获取邮件:用户点击“获取新邮件”按钮,这会导致邮件客户端联系邮箱服务并要求其提供任何新邮件。
Once we have an intermediary, three interesting design opportunities arise. First, the sender and receiver may make different choices of whether to push or pull messages. Push is when the initiator of a data movement sends the data. Pull is when the initiator of a data movement asks the other end to send it the data. These definitions are independent of whether or not the system uses an intermediary, but in systems with intermediaries it is not uncommon to find both in a single system. For example, the sender in the Internet’s e-mail system, Simple Mail Transfer Protocol (SMTP), pushes the mail to the service that holds the recipient’s mailbox. On the other hand, the receiving client pulls messages to fetch mail from a mailbox: the user hits the “Get new mail” button, which causes the mail client to contact the mailbox service and ask it for any new mail.
其次,中介的存在为应用间接解耦模块的设计原则提供了机会,即让中介而不是发件人确定将消息传递给谁。例如,互联网用户可以向president@whitehouse.gov发送消息。转发消息的中介将把消息传递给碰巧是总统的人。再举一个例子,用户应该能够向邮件列表发送电子邮件或在公告板上发布消息,而无需确切知道邮件列表中的人或订阅公告板的人。
Second, the existence of an intermediary opens an opportunity to apply the design principle decouple modules with indirection by having the intermediary, rather than the originator, determine to whom a message is delivered. For example, an Internet user can send a message to president@whitehouse.gov. The intermediary that forwards the message will deliver it to whoever happens to be the President. As another example, users should be able to send an e-mail to a mailing list or to post a message to a bulletin board without knowing exactly who is on the mailing list or subscribed to the bulletin board.
第三,当可以通过中介间接访问时,设计者可以选择何时何地复制消息。在邮件列表示例中,中介将电子邮件副本发送给列表中的每个成员。在公告板示例中,中介可以将消息分组并将它们作为一个组发送给其他中介。当用户从其本地中介获取公告文章时,本地中介将制作最终副本以交付给用户。
Third, when indirection through an intermediary is available, the designer has a choice of when and where to duplicate messages. In the mailing list example, the intermediary sends a copy of the e-mail to each member of the list. In the bulletin board example, an intermediary may group messages and send them as a group to other intermediaries. When a user fetches the bulletin article from its local intermediary, the local intermediary makes a final copy for delivery to the user.
发布/订阅是一种通用的通信方式,它利用了通过中介进行通信的三种设计机会。在此通信模型中,发送者称为发布者,并通知事件服务它已生成有关某个主题的新消息。接收者订阅事件服务并表达他们对某些主题的兴趣。如果多个接收者对同一主题感兴趣,则他们都会收到该消息的副本。发布/订阅的常见用途是电子邮件列表和提供聊天室的即时消息服务。用户可以加入有关某个主题的聊天室。当另一个用户在聊天室中发布消息时,该聊天室的所有成员都会收到该消息。另一个发布/订阅应用程序是 Usenet News,这是一种公告板服务(在侧栏 4.5中对对等计算进行了描述)。
Publish/subscribe is a general style of communication that takes advantage of the three design opportunities of communication through an intermediary. In this communication model, the sender is called the publisher and notifies an event service that it has produced a new message on a certain topic. Recipients subscribe to the event service and express their interest in certain topics. If multiple recipients are interested in the same topic, all of them receive a copy of the message. Popular usages of publish/subscribe are electronic mailing lists and instant messaging services that provide chat rooms. A user might join a chat room on a certain topic. When another user publishes a message in the room, all the members of that room receive it. Another publish/subscribe application is Usenet News, a bulletin board service (described in Sidebar 4.5 on peer-to-peer computing).
客户端/服务模型强制模块化,是组织复杂计算机系统的基本方法。本书的其余部分解决了本章已确定但尚未解决的构建计算机系统的主要问题:
The client/service model enforces modularity and is the basic approach to organizing complex computer systems. The rest of the book works out major issues in building computer systems that this chapter has identified but has not addressed:
在计算机内强制模块化(第 5 章)。将客户端/服务系统的实施限制为每个模块一台计算机可能成本太高。第 5 章介绍操作系统如何使用虚拟化技术从一台物理计算机创建多台虚拟计算机。操作系统可以通过为每个客户端和每个服务提供单独的虚拟计算机来强制每个客户端和每个服务之间的模块化。
Enforcing modularity within a computer (Chapter 5). Restricting the implementation of client/service systems to one computer per module can be too expensive. Chapter 5 shows how an operating system can use a technique called virtualization to create many virtual computers out of one physical computer. The operating system can enforce modularity between each client and each service by giving each client and each service a separate virtual computer.
性能(第 6 章)。计算机系统具有隐式或显式的性能目标。如果服务设计不当,系统中最慢的服务可能会成为性能瓶颈,从而导致整个系统以最慢服务的性能运行。识别性能瓶颈并避免它们是大多数计算机系统中设计人员面临的挑战。
Performance (Chapter 6). Computer systems have implicit or explicit performance goals. If services are not carefully designed, it is possible that the slowest service in the system becomes a performance bottleneck, which causes the complete system to operate at the performance of the slowest service. Identifying performance bottlenecks and avoiding them is a challenge that a designer faces in most computer systems.
网络(第 7 章 [在线])。客户端/服务模型必须有一种方式将请求消息从客户端发送到服务,并返回响应消息。实现SEND_MESSAGE和RECEIVE_MESSAGE是一个具有挑战性的问题,因为网络在客户端和服务之间路由消息时可能会丢失、重新排序或重复消息。此外,网络表现出广泛的性能特性,使得直接的解决方案不够用。
Networking (Chapter 7 [on-line]). The client/service model must have a way to send the request message from the client to the service, and the response message back. Implementing SEND_MESSAGE and RECEIVE_MESSAGE is a challenging problem, since networks may lose, reorder, or duplicate messages while routing them between the client and the service. Furthermore, networks exhibit a wide range of performance properties, making straightforward solutions inadequate.
容错(第 8 章 [在线])。我们可能需要服务即使某些硬件和软件模块发生故障也能继续运行。例如,我们可能希望构建一个在多台计算机上运行的容错日期和时间服务,这样如果其中一台计算机发生故障,另一台计算机仍可以响应日期和时间请求。在使用大量计算机提供单一服务的系统中,某些计算机在任何时刻发生故障是不可避免的。例如,据报道,为网络编制索引的 Google 使用超过 100,000 台计算机来提供服务。(Google 设计的系统的描述可在“进一步阅读建议”3.2.4和10.1.10中找到。)由于计算机数量众多,其中一些肯定不可用。容错技术允许设计人员使用不可靠的组件实现可靠的服务。这些技术包括检测故障、控制故障以及从故障中恢复。
Fault tolerance (Chapter 8 [on-line]). We may need for a service to continue to operate even if some of the hardware and software modules fail. For example, we may want to construct a fault tolerant date-and-time service that runs on several computers so that if one of the computers fails, another computer can still deliver a response to requests for the date and time. In systems that harness a large number of computers to deliver a single service, it is unavoidable that at any instant of time some of the computers will have failed. For example, Google, which indexes the Web, reportedly uses more than 100,000 computers to deliver the service. (A description of the systems Google has designed can be found in Suggestions forFurther Reading 3.2.4 and 10.1.10.) With so many computers, some of them are certain to be unavailable. Techniques for fault tolerance allow designers to implement reliable services out of unreliable components. These techniques involve detecting failures, containing them, and recovering from them.
原子性(第 9 章 [在线])。本章中描述的文件服务(第 4.1.6 节中的图 4.6)必须在并发访问和故障的情况下正常工作,并使用OPEN和CLOSE调用来标记相关的READ和WRITE操作。第 9 章 [在线] 介绍了一个称为原子性的框架,该框架解决了这两个问题。该框架允许将OPEN和CLOSE调用之间的操作作为原子的、不可分割的操作执行。正如我们在第 4.2.2 节中看到的,精确一次 RPC 是实现银行应用程序的理想选择。第 9 章 [在线] 介绍了精确一次 RPC 和构建此类应用程序所需的工具。
Atomicity (Chapter 9 [on-line]). The file service described in this chapter (Figure 4.6 in Section 4.1.6) must work correctly in the face of concurrent access and failures, and use OPEN and CLOSE calls to mark related READ and WRITE operations. Chapter 9 [on-line] introduces a single framework called atomicity that addresses both issues. This framework allows the operations between an OPEN and CLOSE call to be executed as an atomic, indivisible action. As we saw in Section 4.2.2, exactly-once RPC is ideal for implementing a banking application. Chapter 9 [on-line] introduces the necessary tools for exactly-once RPC and building such applications.
一致性(第 10 章 [在线])。本章使用消息来实现各种协议,以确保不同计算机上数据存储的一致性。
Consistency (Chapter 10 [on-line]). This chapter uses messages to implement various protocols to ensure consistency of data stores on different computers.
安全性(第 11 章 [在线])。客户端/服务模型可防止意外错误从一个模块传播到另一个模块。某些服务可能需要防范恶意攻击。例如,当文件服务存储敏感数据并需要确保恶意用户无法读取敏感数据时,就会出现此要求。这种保护要求服务能够可靠地识别用户,以便做出授权决策。面对恶意用户的系统设计是一个称为安全性的主题。
Security (Chapter 11 [on-line]). The client/service model protects against accidental errors propagating from one module to another module. Some services may need to protect against malicious attacks. This requirement arises, for example, when a file service is storing sensitive data and needs to ensure that malicious users cannot read the sensitive data. Such protection requires that the service reliably identify users so that it can make an authorization decision. The design of systems in the face of malicious users is a topic known as security.
解决这些主题的子系统本身就是有趣的系统,也是管理复杂性的案例研究。通常,这些子系统在内部构建为客户/服务系统,以递归方式应用本章的概念。接下来的两节提供了两个现实世界客户/服务系统的案例研究,并说明了后续章节中讨论的主题的必要性。
The subsystems that address these topics are interesting systems in their own right and are case studies of managing complexity. Typically, these subsystems are internally structured as client/service systems, applying the concept of this chapter recursively. The next two sections provide two case studies of real-world client/service systems and also illustrate the need for the topics addressed in the subsequent chapters.
互联网域名系统 (DNS) 是客户端/服务应用程序和命名方案成功实施的绝佳案例研究,在本例中用于命名互联网计算机和服务。尽管 DNS 是为该特定应用程序设计的,但它实际上是一个通用的名称管理和名称解析系统,它以层次结构将名称管理权分配给不同的命名机构,并以层次结构将名称解析工作分配给不同的名称服务器。它的设计使其能够快速响应名称解析请求,并可扩展到存储记录数量和请求数量非常大。它还具有很强的弹性,在面对多种网络和服务器故障时,它能够提供持续、准确的响应。
The Internet Domain Name System (DNS) provides an excellent case study of both a client/service application and a successful implementation of a naming scheme, in this case for naming of Internet computers and services. Although designed for that specific application, DNS is actually a general-purpose name management and name resolution system that hierarchically distributes the management of names among different naming authorities and also hierarchically distributes the job of resolving names to different name servers. Its design allows it to respond rapidly to requests for name resolution and to scale up to extremely large numbers of stored records and numbers of requests. It is also quite resilient, in the sense that it provides continued, accurate responses in the face of many kinds of network and server failures.
DNS 的主要用途是将用户友好的字符串名称(称为域名)与面向机器的网络连接点二进制标识符(称为Internet 地址)关联起来。域名是分层结构的,“域”一词在 DNS 中以一般方式使用:它只是具有相同分层祖先的一个或多个名称的集合。这种约定意味着分层区域可以是域,但也意味着您桌上的个人计算机是一个只有一个成员的域。因此,尽管“域名”一词指的是分层区域的名称,但 DNS 解析的每个名称都称为域名,无论它是分层区域的名称还是单个连接点的名称。由于域通常对应于管理组织,因此它们也是名称分配的委派单位,使用与第3.1.4 节中描述的分层命名方案完全相同的分层命名方案。
The primary use for DNS is to associate user-friendly character-string names, called domain names, with machine-oriented binary identifiers for network attachment points, called Internet addresses. Domain names are hierarchically structured, the term domain being used in a general way in DNS: it is simply a set of one or more names that have the same hierarchical ancestor. This convention means that hierarchical regions can be domains, but it also means that the personal computer on your desk is a domain with just one member. In consequence, although the phrase “domain name” suggests the name of a hierarchical region, every name resolved by DNS is called a domain name, whether it is the name of a hierarchical region or the name of a single attachment point. Because domains typically correspond to administrative organizations, they also are the unit of delegation of name assignment, using exactly the hierarchical naming scheme described in Section 3.1.4.
就我们的目的而言,DNS 的基本接口非常简单:
For our purposes, the basic interface to DNS is quite simple:
值← DNS_RESOLVE (域名)
value ← DNS_RESOLVE (domain_name)
该接口省略了2.2.1 节命名模型的标准名称解析接口中的上下文参数,因为只有一个用于解析所有 Internet 域名的单一、通用、默认上下文,并且对该上下文的引用作为配置参数内置于DNS_RESOLVE中。
This interface omits the context argument from the standard name-resolving interface of the naming model of Section 2.2.1 because there is just a single, universal, default context for resolving all Internet domain names, and the reference to that one context is built into DNS_RESOLVE as a configuration parameter.
在通常的 DNS 实现中,绑定不是通过调用我们的命名模型所建议的BIND和UNBIND过程来完成的,而是通过使用文本编辑器或数据库生成器来创建和管理绑定表。然后,这些表会通过某种幕后方法加载到 DNS 服务器中,频率取决于其管理员认为的必要性。这种设计的一个后果是,DNS 绑定的更改通常不会在您提出请求后的几秒钟内发生;相反,它们通常需要几个小时。
In the usual DNS implementation, binding is not accomplished by invoking BIND and UNBIND procedures as suggested by our naming model, but rather by using a text editor or database generator to create and manage tables of bindings. These tables are then loaded into DNS servers by some behind-the-scenes method as often as their managers deem necessary. One consequence of this design is that changes to DNS bindings don’t often occur within seconds of the time you request them; instead, they typically take hours.
域名是路径名,其组成部分由句点(称为点,尤其是在大声朗读域名时)分隔,并且最不重要的组成部分在前。三种典型的域名是
Domain names are path names, with components separated by periods (called dots, particularly when reading domain names aloud) and with the least significant component coming first. Three typical domain names are
ginger.cse.pedantic.edu ginger.scholarly.edu ginger.com
ginger.cse.pedantic.edu ginger.scholarly.edu ginger.com
DNS 允许使用相对和绝对路径名。绝对路径名应该通过结尾的点来区分。在人机界面中,结尾的点很少出现;相反,DNS_RESOLVE应用一种简单的多重查找形式。当呈现相对路径名时,DNS_RESOLVE首先尝试附加由本地设置的配置参数提供的默认上下文。如果无法解析生成的扩展名称,DNS_RESOLVE将重试,这次仅在最初呈现的名称后附加一个结尾的点。因此,例如,如果向DNS_RESOLVE呈现明显的相对路径名“ ginger.com ”,而默认上下文是“ pedantic.edu。 ”,DNS_RESOLVE将首先尝试解析绝对路径名“ ginger.com.pedantic.edu。 ”。如果该尝试导致NOT-FOUND结果,它将尝试解析绝对路径名“ ginger.com。 ”
DNS allows both relative and absolute path names. Absolute path names are supposed to be distinguished by the presence of a trailing dot. In human interfaces the trailing dot rarely appears; instead, DNS_RESOLVE applies a simple form of multiple lookup. When presented with a relative path name, DNS_RESOLVE first tries appending a default context, supplied by a locally set configuration parameter. If the resulting extended name fails to resolve, DNS_RESOLVE tries again, this time appending just a trailing dot to the originally presented name. Thus, for example, if one presents DNS_RESOLVE with the apparently relative path name “ginger.com”, and the default context is “pedantic.edu.”, DNS_RESOLVE will first try to resolve the absolute path name “ginger.com.pedantic.edu.”. If that attempt leads to a NOT-FOUND result, it will then try to resolve the absolute path name “ginger.com.”
DNS 名称解析至少可以按以下三种方式设计:
DNS name resolution might have been designed in at least three ways:
1.电话簿模型:向每个网络用户提供一份文件副本,其中包含每个域名及其相关互联网地址的列表。此方案存在一个严重的问题:要覆盖整个互联网,文件的大小将与网络用户数量成正比,而更新文件需要向每个用户提供一份新副本。由于更新频率往往与文件中列出的域名数量成正比,因此保持更新所需的网络流量将随着域名数量的立方而增长。此方案在互联网中使用了近 20 年,后来被发现存在不足,并于 20 世纪 80 年代末被 DNS 取代。
1. The telephone book model: Give each network user a copy of a file that contains a list of every domain name and its associated Internet address. This scheme has a severe problem: to cover the entire Internet, the size of the file would be proportional to the number of network users, and updating it would require delivering a new copy to every user. Because the frequency of update tends to be proportional to the number of domain names listed in the file, the volume of network traffic required to keep it up to date would grow with the cube of the number of domain names. This scheme was used for nearly 20 years in the Internet, was found wanting, and was replaced with DNS in the late 1980s.
2.中央目录服务模型:将文件放在网络中某个连接良好的服务器上,并提供协议要求它解析名称。这种方案将使更新变得容易,但随着用户数量的增长,其设计者必须采用越来越复杂的策略,以防止它成为性能瓶颈和潜在的大规模故障源。还有一个问题:控制中央服务器的人默认负责所有名称分配。这种设计不能很好地满足域名分配责任的委托。
2. The central directory service model: Place the file on a single well-connected server somewhere in the network and provide a protocol to ask it to resolve names. This scheme would make update easy, but with growth in the number of users its designer would have to adopt increasingly complex strategies to keep it from becoming both a performance bottleneck and a potential source of massive failure. There is yet another problem: whoever controls the central server is by default in charge of all name assignment. This design does not cater well to delegation of responsibility in assignment of domain names.
3.分布式目录服务模型。其理念是拥有多台服务器,每台服务器负责解析域名的某个子集,并有一个协议来查找可以解析任何特定名称的服务器。正如我们将在以下描述中看到的那样,此模型可以提供委托并响应规模的增加,同时保持可靠性和性能。出于这些原因,DNS 使用此模型。
3. The distributed directory service model. The idea is to have many servers, each of which is responsible for resolving some subset of domain names, and a protocol for finding a server that can resolve any particular name. As we shall see in the following descriptions, this model can provide delegation and respond to increases in scale while maintaining reliability and performance. For those reasons, DNS uses this model.
在分布式目录服务模型中,每个名称服务器的操作都是相同的:服务器维护一组名称记录,每条记录都将域名绑定到 Internet 地址。当客户端发送名称解析请求时,名称服务器会查看其负责的域名集合,如果找到名称记录,它会返回该记录作为响应。如果找不到请求的名称,它会查看一组单独的引用记录。每个引用记录都会将 DNS 名称空间的分层区域绑定到某个其他名称服务器,该名称服务器可帮助解析命名层次结构该区域中的名称。从请求域名的最重要组件开始,服务器在引用记录中搜索与大多数组件匹配的记录,并返回该引用记录。如果没有匹配项,DNS 无法解析原始名称,因此它会返回“无此域”响应。
With the distributed directory service model, the operation of every name server is the same: a server maintains a set of name records, each of which binds a domain name to an Internet address. When a client sends a request for a name resolution, the name server looks through the collection of domain names for which it is responsible, and if it finds a name record, it returns that record as its response. If it does not find the requested name, it looks through a separate set of referral records. Each referral record binds a hierarchical region of the DNS name space to some other name server that can help resolve names in that region of the naming hierarchy. Starting with the most significant component of the requested domain name, the server searches through referral records for the one that matches the most components, and it returns that referral record. If nothing matches, DNS cannot resolve the original name, so it returns a “no such domain” response.
DNS 的引用架构虽然在概念上很简单,但有许多细节可以增强其性能、可扩展性和稳健性。我们先从一个简单的操作示例开始,然后再添加一些增强功能。图 4.10 中的虚线说明了当左下角名为ginger.cse.pedantic.edu的客户端计算机尝试解析域名ginger.Scholarly.edu时 DNS 的操作。第一步(如请求 #1 所示)是DNS_RESOLVE将该域名发送到根名称服务器,它以某种方式知道该服务器的 Internet 地址。第 4.4.4 节解释了DNS_RESOLVE如何发现该地址。
The referral architecture of DNS, though conceptually simple, has a number of elaborations that enhance its performance, scalability, and robustness. We begin with an example of its operation in a simple case, and we later add some of the enhancements. The dashed lines in Figure 4.10 illustrate the operation of DNS when the client computer named ginger.cse.pedantic.edu, in the lower left corner, tries to resolve the domain name ginger.Scholarly.edu. The first step, shown as request #1, is that DNS_RESOLVE sends that domain name to a root name server, whose Internet address it somehow knows. Section 4.4.4 explains how DNS_RESOLVE discovers that address.
图 4.10互联网域名系统的结构和操作。在该图中,每个圆圈代表一个名称服务器,每个矩形代表一个名称客户端。表或响应中的 NS 类型表示这是对另一个名称服务器的引用,而表或响应中的 AP 类型表示这是一个互联网地址。虚线显示了左下角的名称客户端发出的三个请求的路径,这些请求用于解析名称ginger.Scholarly.edu,从根名称服务器开始。点线显示了右下角的名称客户端发出的请求的解析,这些请求用于解析anise.pedantic.edu,从本地名称服务器开始。
Figure 4.10 Structure and operation of the Internet Domain Name System. In this figure, each circle represents a name server, and each rectangle is a name client. The type NS in a table or in a response means that this is a referral to another name server, while the type AP in a table or a response means that this is an Internet address. The dashed lines show the paths of the three requests made by the name client in the lower left corner to resolve the name ginger.Scholarly.edu, starting with the root name server. The dotted lines show resolution of a request of the name client in the lower right corner to resolve anise.pedantic.edu starting with a local name server.
根名称服务器将请求中的名称与其所知的域名子集进行匹配,从所请求域名的最重要组成部分开始(在此示例中为edu)。在此示例中,根名称服务器发现它具有域edu的引荐记录,因此它以引荐做出响应,在此示例中为:“有一个名为edu的域的名称服务器。该名称服务器的名称记录将名称names.edu.绑定到 Internet 地址 192.14.71.191。”此响应说明,名称服务器与任何其他服务器一样,既有域名,也有 Internet 地址。通常,名称服务器的域名会提供一些有关它所服务的域名的线索,但没有必要的对应关系。使用完整的名称记录进行响应提供的信息比客户端实际需要的要多(客户端通常不关心名称服务器的名称),但它可以使来自名称服务器的所有响应保持一致。由于名称服务器的域名并不重要,为了减少图 4.10中的混乱,该图在所示的响应中省略了它。
The root name server matches the name in the request with the subset of domain names it knows about, starting with the most significant component of the requested domain name (in this example, edu). In this example, the root name server discovers that it has a referral record for the domain edu, so it responds with a referral, saying, in this example, “There is a name server for a domain named edu. The name record for that name server binds the name names.edu. to Internet address 192.14.71.191.” This response illustrates that name servers, like any other servers, have both domain names and Internet addresses. Usually, the domain name of a name server gives some clue about what domain names it serves, but there is no necessary correspondence. Responding with a complete name record provides more information than the client really needs (the client usually doesn’t care about the name of the name server), but it allows all responses from a name server to be uniform. Because the name server’s domain name isn’t significant and to reduce clutter in Figure 4.10, that figure omits it in the illustrated response.
当客户端的DNS_RESOLVE收到此响应时,它会立即重新发送相同的名称解析请求,但这次它会将请求(图中的请求 2)定向到位于响应 1 中提到的 Internet 地址的名称服务器。该名称服务器将请求的路径名称与其知道的一组域名进行匹配,同样从最重要的部分开始。在本例中,它在引荐记录中找到了名称Scholarly.edu.的匹配项。因此,它会发回一个响应,说“有一个名为Scholarly.edu 的域的名称服务器。该名称服务器的名称记录将名称ns.iss.edu.绑定到 Internet 地址 128.32.136.9。”该图再次省略了名称服务器的域名。
When the client’s DNS_RESOLVE receives this response, it immediately resends the same name resolution request, but this time it directs the request (request 2 in the figure) to the name server located at the Internet address mentioned in response number 1. That name server matches the requested path name with the set of domain names it knows about, again starting with the most significant component. In this case, it finds a match for the name Scholarly.edu. in a referral record. It thus sends back a response saying, “There is a name server for a domain named Scholarly.edu. The name record for that name server binds the name ns.iss.edu. to Internet address 128.32.136.9.” The illustration again omits the domain name of the name server.
对原始路径名的每个组成部分重复此序列,直到DNS_RESOLVE最终到达具有ginger.Scholarly.edu名称记录的名称服务器。该名称服务器发回响应,称“ ginger.Scholarly.edu的名称记录将该名称绑定到 Internet 地址 169.229.2.16。”这是对原始查询的回答,DNS_RESOLVE将此结果返回给其调用者,后者可以继续与其预期目标发起消息交换。
This sequence repeats for each component of the original path name, until DNS_RESOLVE finally reaches a name server that has the name record for ginger.Scholarly.edu. That name server sends back a response saying, “The name record for ginger.Scholarly.edu. binds that name to Internet address 169.229.2.16.” This being the answer to the original query, DNS_RESOLVE returns this result to its caller, which can go on to initiate an exchange of messages with its intended target.
保存域名的名称记录或引荐记录的服务器称为该域名的权威名称服务器。在我们的示例中,名称服务器ns3.cse.pedantic.edu.对ginger.cse.pedantic.edu.域以及以cse.pedantic.edu.结尾的所有其他域名具有权威性,而ns.iss.edu.对Scholarly.edu.域具有权威性。由于名称服务器不保存其自身名称的名称记录,因此名称服务器不能成为其自身名称的权威名称服务器。相反,例如,根名称服务器对域名edu.具有权威性,而names.edu.名称服务器对所有以 edu 结尾的域名具有权威性。
The server that holds either a name record or a referral record for a domain name is known as the authoritative name server for that domain name. In our example, the name server ns3.cse.pedantic.edu. is authoritative for the ginger.cse.pedantic.edu. domain, as well as all other domain names that end with cse.pedantic.edu., and ns.iss.edu. is authoritative for the Scholarly.edu. domain. Since a name server does not hold the name record for its own name, a name server cannot be the authoritative name server for its own name. Instead, for example, the root name server is authoritative for the domain name edu., while the names.edu. name server is authoritative for all domain names that end in edu.
这就是 DNS 运行的基本模型。下面对其运行进行一些详细说明,每个详细说明都有助于使系统响应速度快、健壮,并能够大规模扩展。
That is the basic model of DNS operation. Here are some elaborations in its operation, each of which helps make the system fast-responding, robust, and capable of growing to a large scale.
1.实际上,没有必要将初始请求发送到根名称服务器。DNS_RESOLVE可以将请求发送到它知道其 Internet 地址的任何方便的名称服务器。名称服务器不关心请求来自哪里;它只是将请求的域名与它负责的域名列表进行比较,以查看它是否拥有可以提供帮助的记录。如果有,它会回答请求。如果没有,它会通过返回对根名称服务器的引用来回答。将任何请求发送到本地名称服务器的能力意味着,客户端、名称服务器和目标域名三者都在同一域中的常见情况(例如pedantic.edu)可以通过单个请求/响应交互快速处理。 (图 4.10右下角的虚线显示了一个例子,其中thyme.pedantic.edu.向pedantic.edu域的名称服务器请求anise.pedantic.edu的地址。)此功能还简化了名称发现,因为客户端只需知道任何附近名称服务器的 Internet 地址。向该附近服务器发出的第一个请求以获取远距离名称(在当前示例中为ginger.scholarly.edu)将返回对根名称服务器的 Internet 地址的引用。
1. It is not actually necessary to send the initial request to the root name server. DNS_RESOLVE can send the request to any convenient name server whose Internet address it knows. The name server doesn’t care where the request came from; it simply compares the requested domain name with the list of domain names for which it is responsible in order to see if it holds a record that can help. If it does, it answers the request. If it doesn’t, it answers by returning a referral to a root name server. The ability to send any request to a local name server means that the common case in which the client, the name server, and the target domain name are all three in the same domain (e.g., pedantic.edu) can be handled swiftly with a single request/response interaction. (The dotted lines in the lower right corner of Figure 4.10 show an example, in which thyme.pedantic.edu. asks the name server for the pedantic.edu domain for the address of anise.pedantic.edu.) This feature also simplifies name discovery because all a client needs to know is the Internet address of any nearby name server. The first request to that nearby server for a distant name (in the current example, ginger.scholarly.edu) will return a referral to the Internet address of a root name server.
2.一些域名服务器提供所谓的递归名称服务(可能具有误导性)。如果名称服务器没有所请求名称的记录,名称服务器将负责解析名称本身,而不是发送引用响应。它将初始请求转发到根名称服务器,然后继续遵循响应链来解析完整路径名称,最后将所需的名称记录返回给其客户端。就其本身而言,此功能似乎只是简化了客户端的工作,但与下一个功能结合使用,它可以大大提高性能。
2. Some domain name servers offer what is (perhaps misleadingly) called recursive name service. If the name server does not hold a record for the requested name, rather than sending a referral response, the name server takes on the responsibility for resolving the name itself. It forwards the initial request to a root name server, then continues to follow the chain of responses to resolve the complete path name, and finally returns the desired name record to its client. By itself, this feature seems merely to simplify life for the client, but in conjunction with the next feature it provides a major performance enhancement.
3.除了权威记录外,每个名称服务器还应维护从其他名称服务器听到的所有名称记录的缓存。因此,提供递归名称服务的服务器会收集记录,从而大大加快未来的名称解析请求。例如,如果cse.pedantic.edu的名称服务器提供递归服务,并要求它解析名称flower.cs.scholarly.edu,则在此过程中(假设它不会依次请求递归服务),其缓存可能会获取以下记录:
edu 指的是 198.41.0.4 上的names.edu
Scholarly.edu 请参阅 ns.iss.edu,网址为 128.32.25.19
cs.Scholarly.edu 请参阅 cs.Scholarly.edu,电话号码 为 128.32.247.24
flower.cs.Scholarly.edu 互联网地址是 128.32.247.29
现在,当该名称服务器收到例如解析名称psych.Scholarly.edu的请求时,它将在缓存中发现域Scholarly.edu的记录,并且能够通过将初始请求直接转发到相应的名称服务器来快速解析该名称。
3. Every name server is expected to maintain, in addition to its authoritative records, a cache of all name records it has heard about from other name servers. A server that provides recursive name service thus collects records that can greatly speed up future name resolution requests. If, for example, the name server for cse.pedantic.edu offers recursive service and it is asked to resolve the name flower.cs.scholarly.edu, in the course of doing so (assuming that it does not in turn request recursive service), its cache might acquire the following records:
edu refer to names.edu at 198.41.0.4
Scholarly.edu refer to ns.iss.edu at 128.32.25.19
cs.Scholarly.edu refer to cs.Scholarly.edu at 128.32.247.24
flower.cs.Scholarly.edu Internet address is 128.32.247.29
Now, when this name server receives, for example, the request to resolve the name psych.Scholarly.edu, it will discover the record for the domain Scholarly.edu in the cache and it will be able to quickly resolve the name by forwarding the initial request directly to the corresponding name server.
缓存中保存着一份副本,如果有人更改了权威名称记录,这份副本可能会过期。由于域名系统中现有名称绑定的更改相对不频繁,并且很难跟踪域名记录可能传播到的所有缓存,因此 DNS 设计不要求明确使更改的条目失效。而是使用过期时间。也就是说,DNS 记录的命名机构会为其发出的每条记录标记一个过期时间,该时间可能从几秒到几个月不等。DNS 缓存管理器应该丢弃已超过过期时间的条目。DNS 缓存管理器提供了一种称为最终一致性的内存模型,该主题将在第 10 章 [在线] 中讨论。
A cache holds a duplicate copy, which may go out of date if someone changes the authoritative name record. On the basis that changes of existing name bindings are relatively infrequent in the Domain Name System and that it is hard to keep track of all the caches to which a domain name record may have propagated, the DNS design does not call for explicit invalidation of changed entries. Instead, it uses expiration. That is, the naming authority for a DNS record marks each record that it sends out with an expiration period, which may range from seconds to months. A DNS cache manager is expected to discard entries that have passed their expiration period. The DNS cache manager provides a memory model that is called eventual consistency, a topic taken up in Chapter 10 [on-line].
域名形成一个层次结构,上述名称服务器的排列与该层次结构相匹配,从而分散了名称解析的工作。同一层次结构还通过分散操作名称服务器的责任来分散管理名称分配的工作。分散责任是分布式目录服务模型的主要优点之一。
Domain names form a hierarchy, and the arrangement of name servers described above matches that hierarchy, thereby distributing the job of name resolution. The same hierarchy also distributes the job of managing the handing out of names, by distributing the responsibility of operating name servers. Distributing responsibility is one of the main virtues of the distributed directory service model.
其工作方式其实很简单:任何操作名称服务器的人都可以成为命名权威机构,这意味着他或她可以向该名称服务器添加权威记录。因此,在互联网发展的早期阶段,一些 Pedantic University 网络管理员为域pedantic.edu部署了一个名称服务器,并说服edu域的管理员为域名pedantic.edu安装一个绑定,该绑定与pedantic.edu名称服务器的名称和互联网地址相关联。现在,如果 Pedantic University 想要添加一条记录,例如,为一个希望命名为archimedes.pedantic.edu的互联网地址,其管理员可以这样做,而无需征得任何人的许可。解析名称archimedes.pedantic.edu的请求可以到达互联网上的任何域名服务器;该请求最终将到达pedantic.edu域的名称服务器,在那里可以正确答复它。同样,学术研究所的网络管理员可以自行安装名为archimedes.Scholarly.edu的互联网地址的名称记录。尽管两所机构都为其中一台计算机选择了archimedes这个名称,但由于域的路径名不同,因此管理员无需协调名称分配。换句话说,他们的命名机构可以独立行动。
The way this works is actually quite simple: whoever operates a name server can be a naming authority, which means that he or she may add authoritative records to that name server. Thus, at some point early in the evolution of the Internet, some Pedantic University network administrator deployed a name server for the domain pedantic.edu and convinced the administrator of the edu domain to install a binding for the domain name pedantic.edu. associated with the name and Internet address of the pedantic.edu name server. Now, if Pedantic University wants to add a record, for example, for an Internet address that it wishes to name archimedes.pedantic.edu, its administrator can do so without asking permission of anyone else. A request to resolve the name archimedes.pedantic.edu can arrive at any domain name server in the Internet; that request will eventually arrive at the name server for the pedantic.edu domain, where it can be answered correctly. Similarly, a network administrator at the Institute for Scholarly Studies can install a name record for an Internet address named archimedes.Scholarly.edu on its own authority. Although both institutions have chosen the name archimedes for one of their computers, because the path names of the domains are distinct there was no need for their administrators to coordinate their name assignments. Put another way, their naming authorities can act independently.
继续这种分散化方法,任何管理名称服务器的组织都可以创建较低级别的命名域。例如,Pedantic 大学的计算机科学与工程系可能拥有如此多的计算机,以至于该部门可以方便地自行管理这些计算机的名称。该部门只需为较低级别的域(例如名为cse.pedantic.edu )部署一个名称服务器,并说服pedantic.edu域的管理员在其名称服务器中安装该名称的引荐记录。
Continuing this method of decentralization, any organization that manages a name server can create lower-level naming domains. For example, the Computer Science and Engineering Department of Pedantic University may have so many computers that it is convenient for the department to manage the names of those computers itself. All that is necessary is for the department to deploy a name server for a lower-level domain (named, for example, cse.pedantic.edu) and convince the administrator of the pedantic.edu domain to install a referral record for that name in its name server.
为了确保名称服务的高可用性,DNS 规范要求每个运行名称服务的组织安排至少两个相同的副本服务器。此规范非常重要,尤其是在域名层次结构的较高级别,因为大多数 Internet 活动都使用域名,无法解析名称组件会阻止访问该名称组件下的所有站点。许多组织都有其名称服务器的三到四个副本,截至 2008 年,根名称服务器的副本约为 80 个。理想情况下,副本应连接到相距较远的网络,以便对本地网络和电力中断提供一些保护。同样,分离连接的重要性在命名层次结构的较高级别上有所增加。因此,根名称服务器的 80 个副本分散在世界各地,但典型组织的名称服务器的三到四个副本更可能位于该组织的校园内。这种安排确保即使校园与外界断开连接,组织内部的名称通信仍可正常工作。另一方面,在这种断网期间,组织外部的通信者甚至无法验证名称是否存在,例如,无法验证电子邮件地址。因此,更好的安排可能是将组织的多个副本名称服务器中的至少一个附加到互联网的另一部分。
To ensure high availability of name service, the DNS specification calls on every organization that runs a name service to arrange that there be at least two identical replica servers. This specification is important, especially at higher levels of the domain naming hierarchy, because most Internet activity uses domain names and inability to resolve a name component blocks reachability to all sites below that name component. Many organizations have three or four replicas of their name servers, and as of 2008 there were about 80 replicas of the root name server. Ideally, replicas should be attached to the network at places that are widely separated, so that there is some protection against local network and electric power outages. Again, the importance of separated attachment increases at higher levels of the naming hierarchy. Thus, the 80 replicas of the root name server are scattered around the world, but the three or four replicas of a typical organization’s name server are more likely to be located within the campus of that organization. This arrangement ensures that, even if the campus is disconnected from the outside world, communication by name within the organization can still work. On the other hand, during such a disconnection, correspondents outside the organization cannot even verify that a name exists, for example, to validate an e-mail address. Therefore, a better arrangement might be to attach at least one of the organization’s multiple replica name servers to another part of the Internet.
由于名称服务器需要复制,许多网络服务也需要复制,因此 DNS 允许将同一个名称绑定到多个 Internet 地址。因此,DNS_RESOLVE返回的值可以是(假定)等效 Internet 地址的列表。客户端可以根据列表中的顺序、先前的响应时间、对连接点距离的猜测或它可能具有的任何其他标准来选择要联系的 Internet 地址。
For the same reason that name servers need to be replicated, many network services also need to be replicated, so DNS allows the same name to be bound to several Internet addresses. In consequence, the value returned by DNS_RESOLVE can be a list of (presumably) equivalent Internet addresses. The client can choose which Internet address to contact, based on order in the list, previous response times, a guess as to the distance to the attachment point, or any other criterion it might have available.
DNS 的设计使名称服务非常强大。原则上,DNS 服务器的工作非常简单:接受请求数据包、搜索表并发送响应数据包。其接口规范不要求它维护任何连接状态或任何其他持久、可变的状态;其唯一的公共接口是幂等的。结果是,一台小型廉价的个人计算机可以为大型组织提供名称服务,这鼓励将计算机专用于此服务。反过来,专用计算机往往比提供多种不同且不相关的网络服务的计算机更强大。此外,可以设计一个具有小型只读表的服务器,以便在发生电源故障等情况时,它可以快速甚至自动恢复服务。(第 8 章 [在线] 和第 9 章 [在线] 讨论了如何设计这样的系统。)
The design of DNS allows name service to be quite robust. In principle, the job of a DNS server is extremely simple: accept a request packet, search a table, and send a response packet. Its interface specification does not require it to maintain any connection state, or any other durable, changeable state; its only public interface is idempotent. The consequence is that a small, inexpensive personal computer can provide name service for a large organization, which encourages dedicating a computer to this service. A dedicated computer, in turn, tends to be more robust than one that supplies several diverse and unrelated network services. In addition a server with small, read-only tables can be designed so that when something such as a power failure happens, it can return to service quickly, perhaps even automatically. (Chapters 8 [on-line] and 9 [on-line] discuss how to design such a system.)
DNS 还允许使用同义词,即间接名称的形式。同义词通常用于解决两个不同的问题。对于第一个问题,假设 Pedantic 大学计算机科学与工程系有一台计算机,其 Internet 地址为minehaha.cse.pedantic.edu。这是一台较旧且速度较慢的机器,但众所周知它非常可靠。该系在这台计算机上运行万维网服务器,但随着负载的增加,该系知道总有一天需要将 Web 服务器移到一台速度更快的名为mississippi.cse.pedantic.edu的机器上。如果没有同义词,当服务器移动时,就需要通知所有人该系的万维网服务有新名称。有了同义词,实验室就可以将间接名称www.cse.pedantic.edu绑定到minehaha.cse.pedantic.edu,并将该间接名称公布为其网站的名称。当mississippi.cse.pedantic.edu接管服务时,它只需让 cse.pedantic.edu 域的管理员更改间接名称的绑定即可。所有使用名称www.cse.pedantic.edu访问网站的客户都会发现该名称继续正常工作;他们并不关心现在有另一台计算机在处理这项工作。一般来说,服务名称的寿命可以预期会超过它们与特定 Internet 地址的绑定,而同义词可以适应这种生命周期差异。
DNS also allows synonyms, in the form of indirect names. Synonyms are used conventionally to solve two distinct problems. For an example of the first problem, suppose that the Pedantic University Computer Science and Engineering Department has a computer whose Internet address is named minehaha.cse.pedantic.edu. This is a somewhat older and slower machine, but it is known to be very reliable. The department runs a World Wide Web server on this computer, but as its load increases the department knows that it will someday be necessary to move the Web server to a faster machine named mississippi.cse.pedantic.edu. Without synonyms, when the server moves, it would be necessary to inform everyone that there is a new name for the department’s World Wide Web service. With synonyms, the laboratory can bind the indirect name www.cse.pedantic.edu to minehaha.cse.pedantic.edu and publicize the indirect name as the name of its Web site. When the time comes for mississippi.cse.pedantic.edu to take over the service, it can do so by simply having the manager of the cse.pedantic.edu domain change the binding of the indirect name. All those customers who have been using the name www.cse.pedantic.edu to get to the Web site will find that name continues to work correctly; they don’t care that a different computer is now handling the job. As a general rule, the names of services can be expected to outlive their bindings to particular Internet addresses, and synonyms cater to this difference in lifetimes.
同义词可以处理的第二个问题是允许一台计算机出现在两个截然不同的命名域中。例如,假设学术研究所的一个地球物理小组开发了一种预测火山爆发的服务,但该组织实际上没有适合运行该服务的计算机。它可以与商业供应商安排在名为service-bureau.com的机器上运行该服务,然后要求研究所名称服务器的管理员将间接名称volcano.iss.edu绑定到service-bureau.com。然后,研究所可以用间接名称宣传其服务。如果商业供应商提高价格,只需重新绑定间接名称即可将服务转移给其他供应商。
The second problem that synonyms can handle is to allow a single computer to appear to be in two widely different naming domains. For example, suppose that a geophysics group at the Institute of Scholarly Studies has developed a service to predict volcano eruptions but that organization doesn’t actually have a computer suitable for running that service. It could arrange with a commercial vendor to run the service on a machine named, perhaps, service-bureau.com and then ask the manager of the Institute’s name server to bind the indirect name volcano.iss.edu to service-bureau.com. The Institute could then advertise its service under the indirect name. If the commercial vendor raises its prices, it would be possible to move the service to a different vendor by simply rebinding the indirect name.
由于解析同义词需要通过 DNS 进行一次额外的往返,并且 DNS 的基本名称到 Internet 地址绑定已经提供了一定程度的间接性,因此一些网络专家建议仅操作名称到 Internet 地址绑定即可获得同义词的效果。
Because resolving a synonym requires an extra round-trip through DNS, and the basic name-to-Internet-address binding of DNS already provides a level of indirection, some network specialists recommend just manipulating name-to-Internet-address bindings to get the effect of synonyms.
名称发现在域名系统中至少出现在三个地方:客户端必须发现附近的名称服务器的名称,用户必须发现所需服务的域名,解析系统必须发现不合格域名的扩展名。
Name discovery comes up in at least three places in the Domain Name System: a client must discover the name of a nearby name server, a user must discover the domain name of a desired service, and the resolving system must discover an extension for unqualified domain names.
首先, DNS_RESOLVE需要知道该名称服务器的 Internet 地址,才能向名称服务器发送请求。DNS_RESOLVE在配置表中查找此地址。名称发现的真正问题是此地址如何进入配置表。原则上,此地址将是根服务器的地址,但正如我们所见,它可以是任何现有名称服务器的地址。最广泛使用的方法是,当计算机首次连接到网络时,它会执行名称发现广播,Internet 服务提供商 (ISP) 会对此进行响应,为连接者分配一个 Internet 地址,并告知连接者由 ISP 运营或为其运营的一个或多个名称服务器的 Internet 地址。终止名称发现的另一种方法是直接与本地网络管理员通信,以获取合适名称服务器的地址,然后将答案配置到DNS_RESOLVE中。
First, in order for DNS_RESOLVE to send a request to a name server, it needs to know the Internet address of that name server. DNS_RESOLVE finds this address in a configuration table. The real name-discovery question is how this address gets into the configuration table. In principle, this address would be the address of a root server, but as we have seen it can be the address of any existing name server. The most widely used approach is that when a computer first connects to a network it performs a name discovery broadcast to which the Internet service provider (ISP) responds by assigning the attacher an Internet address and also telling the attacher the Internet address of one or more name servers operated by or for the ISP. Another way to terminate name discovery is by direct communication with a local network manager, to obtain the address of a suitable name server, followed by configuring the answer into DNS_RESOLVE.
第二种名称发现形式涉及域名本身。如果您希望使用学术研究所的火山预测服务,您需要知道它的名称。必须发生一些以直接通信开始的事件链。通常,人们通过其他网络服务了解域名,例如通过电子邮件、查询搜索引擎、阅读新闻组中的帖子或浏览网页,因此最初的直接通信可能早已被遗忘。但使用其中每一项服务都需要知道域名,因此在更早的时候一定有过直接通信。个人电脑的购买者可能会发现它附带的 Web 浏览器已预先配置了制造商建议的万维网查询和目录服务的域名(以及制造商的支持站点和其他广告商的域名)。同样,互联网服务提供商的新客户通常可以在注册服务时被告知该 ISP 网站的域名,然后可以使用该域名发现许多其他服务的名称。
The second form of name discovery involves domain names themselves. If you wish to use the volcano prediction service at the Institute for Scholarly Studies, you need to know its name. Some chain of events that began with direct communication must occur. Typically, people learn of domain names via other network services, such as by e-mail, querying a search engine, reading postings in newsgroups or while surfing the Web, so the original direct communication may be long forgotten. But using each of those services requires knowing a domain name, so there must have been a direct communication at some earlier time. The purchaser of a personal computer is likely to find that it comes with a Web browser that has been preconfigured with domain names of the manufacturer’s suggested World Wide Web query and directory services (as well as domain names of the manufacturer’s support sites and other advertisers). Similarly, a new customer of an Internet service provider typically may, upon registering for service, be told the domain name of that ISP’s Web site, which can then be used to discover names for many other services.
名称发现的第三个实例涉及用于非限定域名的扩展。回想一下,域名系统使用绝对路径名,因此如果DNS_RESOLVE呈现了非限定名称(例如library),它必须以某种方式将其扩展,例如,扩展为library.pedantic.edu 。用于扩展的默认上下文通常是DNS_RESOLVE的配置参数。此参数的值通常由人类用户在最初设置计算机时选择,目的是尽量减少对最常用域名的输入。
The third instance of name discovery concerns the extension that is used for unqualified domain names. Recall that the Domain Name System uses absolute path names, so if DNS_RESOLVE is presented with an unqualified name such as library it must somehow extend it, for example, to library.pedantic.edu. The default context used for extension is usually a configuration parameter of DNS_RESOLVE. The value of this parameter is typically chosen by the human user when initially setting up a computer, with an eye to minimizing typing for the most frequently used domain names.
DNS 的一个缺点是,尽管它声称在其响应中提供权威的名称解析,但它不使用允许验证这些响应的协议。因此,入侵者有可能(不幸的是,相对容易)伪装成 DNS 服务器并向名称解析请求发送恶意或恶意的响应。
A shortcoming of DNS is that, although it purports to provide authoritative name resolutions in its responses, it does not use protocols that allow authentication of those responses. As a result, it is possible (and, unfortunately, relatively easy) for an intruder to masquerade as a DNS server and send out mischievous or malevolent responses to name resolution requests.
目前,处理此问题的主要方法是让 DNS 用户将其所有响应视为潜在的不可靠提示,并独立验证(使用第 7 章 [在线] 和第 11 章 [在线] 中的术语,我们称之为“执行端到端身份验证”)与该用户通信的任何系统的身份。另一种方法是让 DNS 服务器在与其客户端通信时使用身份验证协议。但是,即使 DNS 响应确实真实,它仍然可能不准确(例如,DNS 缓存可能包含过时的信息,或者 DNS 管理员可能配置了不正确的名称到地址绑定),因此细心的用户仍然希望独立验证其通信者的身份。
Currently, the primary way of dealing with this problem is for the user of DNS to treat all of its responses as potentially unreliable hints and independently verify (using the terminology of Chapters 7 [on-line] and 11 [on-line] we would say “perform end-to-end authentication of”) the identity of any system with which that user communicates. An alternative would be for DNS servers to use authentication protocols in communication with their clients. However, even if a DNS response is assuredly authentic, it still might not be accurate (for example, a DNS cache may hold out-of-date information, or a DNS administrator may have configured an incorrect name-to-address binding), so a careful user would still want to independently authenticate the identity of its correspondents.
第 11 章 [在线] 描述了可用于身份验证的协议;网络专家们一直在争论是否或如何升级 DNS 以使用此类协议。
Chapter 11 [on-line] describes protocols that can be used for authentication; there is an ongoing debate among network experts as to whether or how DNS should be upgraded to use such protocols.
有兴趣了解更多有关 DNS 的读者可以查阅 DNS 阅读材料中的文档 [进一步阅读建议 4.3 ] 。
The reader interested in learning more about DNS should explore the documents in the readings for DNS [Suggestions for Further Reading 4.3].
网络文件系统 (NFS) 由 Sun Microsystems, Inc. 于 20 世纪 80 年代设计,是一种客户端/服务应用程序,可通过网络为客户端提供共享文件存储。NFS 客户端将远程文件系统嫁接到客户端的本地文件系统名称空间,并使其行为类似于本地UNIX文件系统(请参阅第 2.5 节)。多个客户端可以挂载同一个远程文件系统,以便用户可以共享文件。
The Network File System (NFS), designed by Sun Microsystems, Inc. in the 1980s, is a client/service application that provides shared file storage for clients across a network. An NFS client grafts a remote file system onto the client’s local file system name space and makes it behave like a local UNIX file system (see Section 2.5). Multiple clients can mount the same remote file system so that users can share files.
由于技术进步,人们开始需要 NFS。20 世纪 80 年代之前,计算机非常昂贵,以至于每台计算机都必须由多个用户共享,并且每台计算机都只有一个文件系统。但经济压力的一个好处是,它可以轻松协作,因为用户可以轻松共享文件。20 世纪 80 年代初,建造工作站在经济上是可行的,这使得每个工程师都可以拥有一台私人计算机。但用户仍然希望拥有一个共享文件系统,以便于协作。NFS 正是提供了这一点:它允许任何工作站上的用户使用存储在共享服务器上的文件,共享服务器是一个功能强大的工作站,具有本地磁盘,但通常没有图形显示。
The need for NFS arose because of technology improvements. Before the 1980s, computers were so expensive that each one had to be shared among multiple users and each computer had a single file system. But a benefit of the economic pressure was that it allowed for easy collaboration because users could share files easily. In the early 1980s, it became economically feasible to build workstations, which allowed each engineer to have a private computer. But users still desired to have a shared file system for ease of collaboration. NFS provides exactly that: it allows a user at any workstation to use files stored on a shared server, a powerful workstation with local disks but often without a graphical display.
NFS 还简化了工作站集合的管理。如果没有 NFS,系统管理员必须管理每个工作站,例如,安排每个工作站本地磁盘的备份。NFS 允许集中管理;例如,系统管理员只需备份服务器的磁盘即可存档文件系统。在 20 世纪 80 年代,这种设置还具有成本效益:NFS 允许组织购买不带磁盘的工作站,从而节省了每个工作站上磁盘接口的成本以及每个工作站上未使用的磁盘空间的成本。
NFS also simplifies the management of a collection of workstations. Without NFS, a system administrator must manage each workstation and, for example, arrange for backups of each workstation’s local disk. NFS allows for centralized management; for example, a system administrator needs to back up only the server’s disks to archive the file system. In the 1980s, the setup also had a cost benefit: NFS allowed organizations to buy workstations without disks, saving the cost of a disk interface on every workstation and the cost of unused disk space on each workstation.
NFS 的设计者有四个主要目标。NFS 应该与现有应用程序协同工作,这意味着 NFS 应该提供与本地UNIX文件系统相同的语义。NFS 应该易于部署,这意味着它的实现应该能够适应现有的UNIX系统。客户端应该可以在其他操作系统(如 Microsoft 的 DOS)中实现,以便个人计算机上的用户可以访问 NFS 服务器上的文件;此目标意味着客户端/服务消息不能过于特定于UNIX系统。最后,NFS 应该足够高效,以便用户可以忍受,但它不必提供与本地文件系统一样高的性能。NFS 仅部分实现了这些目标,因为实现所有这些目标很困难。设计者做出了权衡:简化设计并放弃一些UNIX语义。
The designers of NFS had four major goals. NFS should work with existing applications, which means NFS should provide the same semantics as a local UNIX file system. NFS should be deployable easily, which means its implementation should be able to retrofit into existing UNIX systems. The client should be implementable in other operating systems such as Microsoft’s DOS, so that a user on a personal computer can have access to the files on an NFS server; this goal implies that the client/service messages cannot be too UNIX system-specific. Finally, NFS should be efficient enough to be tolerable to users, but it doesn’t have to provide as high performance as local file systems. NFS only partially achieves these goals because achieving them all is difficult. The designers made a trade-off: simplify the design and lose some of the UNIX semantics.
本节介绍 NFS 的版本 2。版本 1 从未在 Sun Microsystems 之外部署过,而版本 2 自 1987 年以来一直在使用。案例研究最后简要总结了版本 3(20 世纪 90 年代)和版本 4(21 世纪初)的变化,这些变化解决了版本 2 中的弱点。问题集3探索了受 NFS 启发的设计,以强化 NFS 中的理念。
This section describes version 2 of NFS. Version 1 was never deployed outside of Sun Microsystems, while version 2 has been in use since 1987. The case study concludes with a brief summary of the changes in versions 3 (1990s) and 4 (early 2000s), which address weaknesses in version 2. Problem set 3 explores an NFS-inspired design to reinforce the ideas in NFS.
对于程序来说,NFS 就像是一个UNIX文件系统,提供了第 2.5 节中介绍的文件接口。用户程序可以像命名本地文件一样命名远程文件。当用户程序调用OPEN (“/users/alice/.profile”, READONLY ) 时,它无法从路径名中判断“users”或“alice”是本地目录还是远程目录。
To programs, NFS appears as a UNIX file system providing the file interface presented in Section 2.5. User programs can name remote files in the same way as local files. When a user program invokes, say, OPEN (“/users/alice/.profile”, READONLY ), it cannot tell from the path name whether “users” or “alice” are local or remote directories.
为了使远程文件的命名对用户及其程序透明,NFS 客户端必须在本地名称空间上挂载远程文件系统。NFS 使用一个称为 mounter 的单独程序执行此操作。此程序的功能与MOUNT调用(如第 2.5.10 节中所述)类似;它将远程文件系统(以host : path命名,其中host是 DNS 名称,path 是路径名)移植到本地文件名称空间。 mounter 向文件服务器host发送远程过程调用并请求文件句柄( path标识的对象的 32 字节名称) 。收到回复后,NFS 客户端将本地文件系统中的挂载点标记为远程文件系统。它还会记住path的文件句柄和服务器的网络地址。
To make naming remote files transparent to users and their programs, the NFS client must mount a remote file system on the local name space. NFS performs this operation by using a separate program, called the mounter. This program serves a similar function as the MOUNT call (described in Section 2.5.10); it grafts the remote file system—named by host:path, where host is a DNS name and path a path name—onto the local file name space. The mounter sends a remote procedure call to the file server host and asks for a file handle, a 32-byte name for the object identified by path. On receiving the reply, the NFS client marks the mount point in the local file system as a remote file system. It also remembers the file handle for path and the network address for the server.
对于 NFS 客户端来说,文件句柄是一个 32 字节的不透明名称,用于标识远程 NFS 服务器上的对象。当 NFS 客户端挂载远程文件系统或在 NFS 服务器上的目录中查找文件时,它会从服务器获取文件句柄。在对该文件的所有后续 NFS 服务器远程过程调用中,NFS 客户端都会包含文件句柄。在许多方面,文件句柄类似于 inode 编号;它对应用程序不可见,但它用作 NFS 内部的名称来命名文件。
To the NFS client a file handle is a 32-byte opaque name that identifies an object on a remote NFS server. An NFS client obtains file handles from the server when the client mounts a remote file system, or it looks up a file in a directory on the NFS server. In all subsequent remote procedure calls to the NFS server for that file, the NFS client includes the file handle. In many ways the file handle is similar to an inode number; it is not visible to applications, but it used as a name internal to NFS to name files.
对于 NFS 服务器来说,文件句柄是一个结构化的名称(包含文件系统标识符、inode 编号和生成编号),服务器可以使用它来定位文件。文件系统标识符允许服务器识别负责该文件的文件系统。inode 编号(请参见第 58 页)允许已识别的文件系统在磁盘上定位文件。
To the NFS server a file handle is a structured name—containing a file system identifier, an inode number, and a generation number—which the server can use to locate the file. The file system identifier allows the server to identify the file system responsible for the file. The inode number (see page 58) allows the identified file system to locate the file on the disk.
有人可能会想,为什么 NFS 设计者不选择将路径名放在文件句柄中。要了解原因,请考虑以下两个用户程序在不同客户端上运行的场景:
One might wonder why the NFS designers didn’t choose to put path names in file handles. To see why, consider the following scenario with two user programs running on different clients:
RENAME ( source , destination ) 将source的名称更改为destination。程序 2 中的第一个重命名操作(第3行)将“dir1”重命名为“dir2”,第二个重命名操作(第4行)将“dir3”重命名为“dir1”。这种情况引发了以下问题:当程序 1 在两个重命名操作完成后调用READ(第5行)时,程序 1 是从“dir1/f”读取数据,还是从“dir2/f”读取数据?
RENAME (source, destination) changes the name of source to destination. The first rename operation (on line 3) in program 2 renames “dir1” to “dir2”, and the second one (on line 4) renames “dir3” to “dir1”. This scenario raises the following question: when program 1 invokes READ (line 5) after the two rename operations have completed, does program 1 read data from “dir1/f”, or “dir2/f”?
如果两个程序在同一台计算机上运行并共享本地UNIX文件系统,则根据 UNIX 规范,程序 1 将读取“dir2/f” 。目标是 NFS 应提供相同的行为。如果 NFS 服务器将路径名放在句柄内,则READ调用将导致对文件“dir1/f”的远程过程调用。通过将 inode 编号放在句柄中,可以满足规范。
If the two programs were running on the same computer and sharing a local UNIX file system, program 1 would read “dir2/f”, according to the UNIX specification. The goal is that NFS should provide the same behavior. If the NFS server were to put path names inside handles, then the READ call would result in a remote procedure call for the file “dir1/f”. By putting the inode number in the handle the specification is met.
文件句柄包含一个代号,可以几乎正确处理如下场景:
The file handle includes a generation number to handle scenarios such as the following almost correctly:
客户端 1 上的程序删除了文件“f”(第2行),并创建了一个同名的新文件(第3行),而客户端 2 上的另一个程序已经打开了原始文件(第1行)。如果这两个程序在同一台计算机上运行并共享本地UNIX文件系统,则程序 2 将读取旧文件(第4行)。
A program on a client 1 deletes a file “f” (line 2) and creates a new file with the same name (line 3), while another program on a client 2 already has opened the original file (on line 1). If the two programs were running on the same computer and sharing a local UNIX file system, program 2 would read the old file on line 4.
如果服务器碰巧将旧文件的 inode 重用于新文件,则来自客户端 2 的远程过程调用将获取新文件(即客户端 1 创建的文件),而不是旧文件。生成编号允许 NFS 避免这种不正确的行为。当服务器重用 inode 时,它会将生成编号增加一。在示例中,客户端 1 和客户端 2 将接收不同的文件句柄,而客户端 2 将使用旧句柄。增加生成编号可使 NFS 服务器始终安全地立即回收 inode。
If the server should happen to reuse the inode of the old file for the new file, remote procedure calls from client 2 will get the new file, the one created by client 1, instead of the old file. The generation number allows NFS to avoid this incorrect behavior. When the server reuses an inode, it increases the generation number by one. In the example, client 1 and client 2 would receive different file handles, and client 2 will use the old handle. Increasing the generation number makes it always safe for the NFS server to recycle inodes immediately.
对于这种情况,NFS 不提供与本地UNIX文件系统相同的语义,因为这需要服务器知道哪些文件正在使用。使用 NFS,当客户端 2 使用文件句柄时,它将收到一条错误消息:“过时的文件句柄”。这种情况是 NFS 设计者牺牲一些UNIX语义以获得更简单的实现的一个例子。
For this scenario, NFS does not provide identical semantics to a local UNIX file system because that would require that the server know which files are in use. With NFS, when client 2 uses the file handle, it will receive an error message: “stale file handle”. This case is one example of the NFS designers trading some UNIX semantics to obtain a simpler implementation.
文件句柄在服务器发生故障时仍可使用,因此即使服务器计算机在客户端程序打开文件并读取文件之间发生故障并重新启动,服务器也可以使用文件句柄中的信息识别文件。要使文件句柄(包括文件系统标识符和世代号)在服务器发生故障时仍可使用,需要对服务器的磁盘文件系统进行一些小改动:NFS 设计人员修改了超级块以记录文件系统标识符,并修改了 inode 以记录 inode 的世代号。记录这些信息后,NFS 服务器在重新启动后将能够处理服务器在发生故障之前发出的 NFS 请求。
File handles are usable across server failures, so that even if the server computer fails and restarts between a client program opening a file and then reading from the file, the server can identify the file using the information in the file handle. Making file handles (which include a file system identifier and a generation number) usable across server failures requires small changes to the server’s on-disk file system: the NFS designers modified the super block to record the file system identifier and modified inodes to record the generation number for the inode. With this information recorded, after a reboot the NFS server will be able to process NFS requests that the server handed out before it failed.
表 4.1显示了 NFS 使用的远程过程调用。远程过程调用最好通过示例来解释。假设我们有以下用户程序片段:
Table 4.1 shows the remote procedure calls used by NFS. The remote procedure calls are best explained by example. Suppose we have the following fragment of a user program:
表 4.1. NFS 远程过程调用
Table 4.1. NFS Remote Procedure Calls
| 远程过程调用 | 返回 |
| 无效的() | 没做什么。 |
| LOOKUP(dirfh,名称) | fh 和文件属性 |
| 创建(dirfh,名称,属性) | fh 和文件属性 |
| 删除(dirfh,名称) | 地位 |
| 获取属性( fh ) | 文件属性 |
| SETATTR(fh,属性) | 文件属性 |
| READ(fh,偏移量,计数) | 文件属性和数据 |
| 写入(fh,偏移量,计数,数据) | 文件属性 |
| 重命名(dirfh,name,tofh,toname) | 地位 |
| LINK(dirfh,名称,tofh,toname) | 地位 |
| SYMLINK(dirfh,名称,字符串) | 地位 |
| 阅读链接( fh ) | 细绳 |
| MKDIR ( dirfh ,名称,属性) | fh 和文件属性 |
| RMDIR(dirfh,名称) | 地位 |
| READDIR(dirfh,偏移量,计数) | 目录条目 |
| 状态文件系统( fh ) | 文件系统信息 |
fd ← OPEN (“f”, READONLY )
fd ← OPEN (“f”, READONLY)
读取(fd,buf,n)
READ (fd, buf, n)
关闭( fd )
CLOSE (fd)
图 4.11显示了相应的时序图,其中“f”是远程文件。NFS 客户端使用一个或多个远程过程调用来实现每个文件系统操作。
Figure 4.11 shows the corresponding timing diagram where “f” is a remote file. The NFS client implements each file system operation using one or more remote procedure calls.
图 4.11 NFS 客户端与服务之间的交互示例。由于 NFS 服务是无状态的,因此当应用程序调用CLOSE时,客户端无需通知服务。相反,它只需释放fd并返回即可。
Figure 4.11 Example interaction between an NFS client and service. Since the NFS service is stateless, the client does not need to inform the service when the application calls CLOSE. Instead, it just deallocates fd and returns.
为了响应程序对OPEN的调用,NFS 客户端向服务器发送以下远程过程调用:
In response to the program’s call to OPEN, the NFS client sends the following remote procedure call to the server:
LOOKUP ( dirfh , “f”)
LOOKUP (dirfh, “f”)
在程序运行之前,客户端就拥有当前工作目录 ( dirfh ) 的文件句柄。它是通过之前的查找或挂载远程文件系统获得此句柄的。
From before the program runs, the client has a file handle for the current working directory’s (dirfh). It obtained this handle as a result of a previous lookup or as a result of mounting the remote file system.
收到LOOKUP请求后,NFS 服务器从dirfh中提取文件系统标识符和 inode 编号,并要求已识别的文件系统在dirfh中查找 inode 编号。已识别的文件系统使用dirfh中的 inode 编号来定位目录的 inode。现在,NFS 服务器在 inode 编号标识的目录中搜索“f”。如果存在,服务器将为“f”创建一个句柄。该句柄包含本地文件系统的文件系统标识符、“f”的 inode 编号以及“f”的 inode 中存储的生成编号。NFS 服务器将此文件句柄发送给客户端。
On receiving the LOOKUP request, the NFS server extracts the file system identifier and inode number from dirfh, and asks the identified file system to look up the inode number in dirfh. The identified file system uses the inode number in dirfh to locate the directory’s inode. Now the NFS server searches the directory identified by the inode number for “f”. If present, the server creates a handle for “f”. The handle contains the file system identifier of the local file system, the inode number for “f”, and the generation number stored in the inode of “f”. The NFS server sends this file handle to the client.
收到响应后,客户端在程序的文件描述符表中分配第一个未使用的条目,在该条目中存储对 f 的文件句柄的引用,并将该条目的索引(fd)返回给用户程序。
On receiving the response, the client allocates the first unused entry in the program’s file descriptor table, stores a reference to f’s file handle in that entry, and returns the index for the entry (fd) to the user program.
接下来,程序调用READ ( fd , buf , n )。客户端向 NFS 服务器发送以下远程过程调用:
Next, the program calls READ (fd, buf, n). The client sends the following remote procedure call to the NFS server:
读取(fh,0,n)
READ (fh, 0, n)
与目录文件句柄一样,NFS 服务器在 inode 中查找fh。然后,服务器读取数据,并将数据在回复消息中发送给客户端。
As with the directory file handle, the NFS server looks up the inode for fh. Then, the server reads the data and sends the data in a reply message to the client.
当程序调用CLOSE来告诉本地文件系统它已完成对文件描述符fd 的操作时,NFS 不会发出CLOSE远程过程调用;协议没有CLOSE远程过程调用。由于应用程序未修改文件,因此 NFS 客户端不必发出任何远程过程调用。正如我们将在第 4.5.4 节中看到的那样,如果程序修改了文件,NFS 客户端将在CLOSE系统调用上发出远程过程调用,以保证文件的一致性。
When the program calls CLOSE to tell the local file system that it is done with the file descriptor fd, NFS doesn’t issue a CLOSE remote procedure call; the protocol doesn’t have a CLOSE remote procedure call. Because the application didn’t modify the file, the NFS client doesn’t have to issue any remote procedure calls. As we shall see in Section 4.5.4, if a program modifies a file, the NFS client will issue remote procedure calls on a CLOSE system call to provide coherence for the file.
NFS 远程过程调用的设计使得服务器可以无状态,也就是说,服务器不需要维护磁盘文件以外的任何其他状态。NFS 通过使每个远程过程调用包含执行该请求所需的所有信息来实现此属性。服务器不维护有关过去远程过程调用的任何状态来处理新请求。例如,客户端(而不是服务器)必须跟踪文件游标(请参见第 2.3.2 节),并且客户端将其作为参数包含在READ远程过程调用中。再举一个例子,文件句柄包含在服务器上查找 inode 的所有信息,如上所述。
The NFS remote procedure calls are designed so that the server can be stateless, that is, the server doesn’t need to maintain any other state than the on-disk files. NFS achieves this property by making each remote procedure call contain all the information necessary to carry out that request. The server does not maintain any state about past remote procedure calls to process a new request. For example, the client, not the server, must keep track of the file cursor (see Section 2.3.2), and the client includes it as an argument in the READ remote procedure call. As another example, the file handle contains all information to find the inode on the server, as explained above.
这种无状态属性简化了服务器故障的恢复:客户端只需重复请求,直到收到回复即可。事实上,客户端无法区分发生故障并恢复的服务器和运行缓慢的服务器。由于客户端重复请求,直到收到回复,因此服务器可能会执行两次请求。也就是说,NFS 为远程过程调用实现了至少一次语义。
This stateless property simplifies recovery from server failures: a client can just repeat a request until it receives a reply. In fact, the client cannot tell the difference between a server that failed and recovered, and a server that is slow. Because a client repeats a request until it receives a response, it can happen that the server executes a request twice. That is, NFS implements at-least-once semantics for remote procedure calls.
由于许多请求都是幂等的(例如LOOKUP、READ等),所以这不是问题,但对于某些请求,它会导致令人惊讶的行为。考虑一个用户程序对存储在远程文件系统上的现有文件调用 UNLINK。NFS 客户端将发送REMOVE远程过程调用,服务器将执行它,但可能会发生网络丢失答复的情况。在这种情况下,客户端将重新发送REMOVE请求,服务器将再次执行该请求,并且用户程序将收到一条错误消息,指出该文件不存在!
Since many requests are idempotent (e.g., LOOKUP, READ, etc.), that is not a problem, but for some requests it results in surprising behavior. Consider a user program that calls UNLINK on an existing file that is stored on a remote file system. The NFS client would send a REMOVE remote procedure call and the server would execute it, but it could happen that the network lost the reply. In that case, the client would resend the REMOVE request, the server would execute the request again, and the user program would receive an error saying that the file didn’t exist!
NFS 的后续实现通过避免在没有服务器故障时多次执行远程过程调用来最大限度地减少这种意外行为。在这些实现中,每个远程过程调用都标有一个事务编号,并且服务器维护一些“软”状态(如果服务器发生故障,则会丢失),即回复缓存。回复缓存按事务标识符索引并记录事务标识符的响应。当服务器收到请求时,它会在回复缓存中查找事务标识符 (ID)。如果 ID 在缓存中,服务器将从缓存中返回回复,而无需重新执行请求。如果 ID 不在缓存中,服务器将处理该请求。
Later implementations of NFS minimize this surprising behavior by avoiding executing remote procedure calls more than once when there are no server failures. In these implementations, each remote procedure call is tagged with a transaction number and the server maintains some “soft” state (it is lost if the server fails), namely, a reply cache. The reply cache is indexed by transaction identifier and records the response for the transaction identifier. When the server receives a request, it looks up the transaction identifier (ID) in the reply cache. If the ID is in the cache, the server returns the reply from the cache, without reexecuting the request. If the ID is not in the cache, the server processes the request.
如果服务器没有故障,则重试REMOVE请求将收到与第一次尝试相同的响应。但是,如果服务器在第一次尝试和重试之间发生故障并重新启动,则请求将被执行两次。设计人员选择将回复缓存保持为软状态,因为将其存储在非易失性存储中成本很高。这样做需要将回复缓存存储在磁盘等上,并且需要对每个远程过程调用进行磁盘写入以记录响应。如第6.1.8 节所述,磁盘写入通常是性能瓶颈,并且比远程过程调用成本高得多。
If the server doesn’t fail, a retry of a REMOVE request will receive the same response as the first attempt. If, however, the server fails and restarts between the first attempt and a retry, the request is executed twice. The designers opted to maintain the reply cache as soft state because storing it in non-volatile storage is expensive. Doing so would require that the reply cache be stored, for example, on a disk and would require a disk write for each remote procedure call to record the response. As explained in Section 6.1.8, disk writes are often a performance bottleneck and much more expensive than a remote procedure call.
尽管 NFS 的无状态特性简化了恢复,但它使得无法正确实现UNIX文件接口,因为UNIX规范要求维护状态。再考虑一个程序删除另一个程序打开的文件的情况。UNIX 规范规定,该文件存在,直到第二个程序关闭该文件。
Although the stateless property of NFS simplifies recovery, it makes it impossible to implement the UNIX file interface correctly because the UNIX specification requires maintaining state. Consider again the case where one program deletes a file that another program has open. The UNIX specification is that the file exists until the second program closes the file.
如果程序在不同的客户端上运行,NFS 就无法遵守此规范,因为它要求服务器保持状态。它必须维护每个文件的引用计数,该计数在OPEN系统调用时递增,在CLOSE系统调用时递减,并在服务器故障时持续存在。此外,如果客户端不响应消息,服务器必须等到客户端再次可访问时才能减少引用计数。相反,NFS 只是做了一件简单但略显错误的事情:如果另一个客户端上的程序删除了第一个客户端打开的文件,远程过程调用将返回错误“过时的文件句柄”。
If the programs run on different clients, NFS cannot adhere to this specification because it would require that the server keep state. It would have to maintain a reference count per file, which would be incremented on an OPEN system call and decremented on a CLOSE system call, and persist across server failures. In addition, if a client would not respond to messages, the server would have to wait until the client becomes reachable again to decrement the reference count. Instead, NFS just does the easy but slightly wrong thing: remote procedure calls return an error “stale file handle” if a program on another client deletes a file that the first client has open.
NFS 并未忠实地实现UNIX规范,因为这会简化 NFS 的设计。NFS 保留了大部分UNIX语义,只有在极少数情况下,用户才会看到不同的行为。实际上,这些情况并不是一个严重的问题,因此 NFS 可以通过简单的恢复来解决问题。
NFS does not implement the UNIX specification faithfully because that simplifies the design of NFS. NFS preserves most of the UNIX semantics, and only in rarely encountered situations may users see different behavior. In practice, these situations are not a serious problem, and in return NFS gets by with simple recovery.
为了将 NFS 实现为UNIX文件系统的扩展,同时尽量减少对UNIX文件系统的更改,NFS 设计人员通过引入提供vnode(虚拟节点)的接口来拆分文件系统程序(见图4.12)。vnode 是易失性存储器中的一种结构,它抽象了文件或目录是由本地文件系统还是远程文件系统实现。这种设计允许文件系统调用层中的许多功能以 vnode 的形式实现,而不必担心文件或目录是本地的还是远程的。该接口还有一个额外的优点:计算机可以轻松支持多个不同的本地文件系统。
To implement NFS as an extension of the UNIX file system while minimizing the number of changes required to the UNIX file system, the NFS designers split the file system program by introducing an interface that provides vnodes, virtual nodes (see Figure 4.12). A vnode is a structure in volatile memory that abstracts whether a file or directory is implemented by a local file system or a remote file system. This design allows many functions in the file system call layer to be implemented in terms of vnodes, without having to worry about whether a file or directory is local or remote. The interface has an additional advantage: a computer can easily support several, different local file systems.
图4.12 UNIX系统的NFS实现
Figure 4.12 NFS implementation for the UNIX system
当文件系统调用必须对文件执行操作(例如,读取数据)时,它会通过 vnode 接口调用相应的过程。vnode 接口具有在目录 vnode 的内容中查找文件名、从 vnode 读取、写入 vnode、关闭 vnode 等过程。本地文件系统和 NFS 支持这些过程的自身实现。
When a file system call must perform an operation on a file (e.g., reading data), it invokes the corresponding procedure through the vnode interface. The vnode interface has procedures for looking up a file name in the contents of a directory vnode, reading from a vnode, writing to a vnode, closing a vnode, and so on. The local file system and NFS support their own implementation of these procedures.
通过使用 vnode 接口,可以将文件描述符表、当前目录、名称查找等大部分代码从本地文件系统模块移到文件系统调用层,且工作量很小。例如,只需进行少量更改,就可以将第 2.5 节中的PATHNAME_TO_INODE过程修改为PATHNAME_TO_VNODE,并由文件系统调用层提供。
By using the vnode interface, most of the code for file descriptor tables, current directory, name lookup, and the like, can be moved from the local file system module into the file system call layer with minimal effort. For example, with a few changes, the procedure PATHNAME_TO_INODE from Section 2.5 can be modified to be PATHNAME_TO_VNODE and be provided by the file system call layer.
为了说明 vnode 设计,我们考虑一个调用文件OPEN的用户程序(见图4.12)。要打开文件,文件系统调用层调用PATHNAME_TO_VNODE,将当前工作目录的 vnode 和文件的路径名作为参数传递。PATHNAME_TO_VNODE将解析路径名,为路径名中的每个组件调用vnode 层中的LOOKUP。如果目录是本地目录,则 vnode 层LOOKUP调用本地文件系统实现的LOOKUP过程以获取路径名组件的 vnode。如果目录是远程目录,则LOOKUP调用NFS 客户端实现的LOOKUP过程。
To illustrate the vnode design, we consider a user program that invokes OPEN for a file (see Figure 4.12). To open the file, the file system call layer invokes PATHNAME_TO_VNODE, passing the vnode for the current working directory and the path name for the file as arguments. PATHNAME_TO_VNODE will parse the path name, invoking LOOKUP in the vnode layer for each component in the path name. If the directory is a local directory, the vnode-layer LOOKUP invokes the LOOKUP procedure implemented by the local file system to obtain a vnode for the path name component. If the directory is a remote directory, LOOKUP invokes the LOOKUP procedure implemented by the NFS client.
NFS 客户端调用NFS 服务器上的LOOKUP远程过程调用,将目录的文件句柄和路径名的组件作为参数传递。收到查找请求后,NFS 服务器从目录的文件句柄中提取文件系统标识符和 inode 编号,以查找目录的 vnode,然后调用vnode 层中的LOOKUP,将路径名的组件作为参数传递。如果目录是由服务器的本地文件系统实现的,则 vnode 层调用由服务器的本地文件系统实现的过程LOOKUP,将路径名的组件作为参数传递。本地文件系统查找名称,如果存在,则创建一个 vnode 并将 vnode 返回给 NFS 服务器。NFS 服务器向 NFS 客户端发送包含 vnode 的文件句柄和 vnode 的一些元数据的回复。
The NFS client invokes the LOOKUP remote procedure call on the NFS server, passing as arguments the file handle of the directory and the path name’s component. On receiving the lookup request, the NFS server extracts the file system identifier and inode number from the file handle for the directory to look up the directory’s vnode and then invokes LOOKUP in the vnode layer, passing the path name’s component as an argument. If the directory is implemented by the server’s local file system, the vnode layer invokes the procedure LOOKUP implemented by the server’s local file system, passing the path name’s component as an argument. The local file system looks up the name and, if present, creates a vnode and returns the vnode to the NFS server. The NFS server sends a reply containing the vnode’s file handle and some metadata for the vnode to the NFS client.
收到回复后,NFS 客户端会在客户端计算机上创建一个包含文件句柄的 vnode,并将其返回给客户端计算机上的文件系统调用层。当文件系统调用层解析了完整的路径名后,它会将文件的文件描述符返回给用户程序。
On receiving the reply, the NFS client creates a vnode, which contains the file handle, on the client computer and returns it to the file system call layer on the client machine. When the file system call layer has resolved the complete path name, it returns a file descriptor for the file to the user program.
为了实现可用的性能,典型的 NFS 客户端会维护各种缓存。客户端会存储每个打开文件的 vnode,以便客户端知道打开文件的文件句柄。客户端还会缓存最近使用的 vnode、它们的属性、这些缓存 vnode 的最近使用的块以及从路径名到 vnode 的映射。缓存可减少远程文件上的文件系统操作的延迟,因为对于缓存的文件,客户端可以避免远程过程调用的成本。此外,由于客户端进行的远程过程调用较少,因此单个服务器可以支持更多客户端。但是,如果多个客户端缓存同一个文件,NFS 必须以某种方式确保读/写一致性。
To achieve usable performance, a typical NFS client maintains various caches. A client stores the vnode for every open file so that the client knows the file handles for open files. A client also caches recently used vnodes, their attributes, recently used blocks of those cached vnodes, and the mapping from path name to vnode. Caching reduces the latency of file system operations on remote files because for cached files a client can avoid the cost of remote procedure calls. In addition, because clients make fewer remote procedure calls, a single server can support more clients. If multiple clients cache the same file, however, NFS must ensure read/write coherence in some way.
当多个程序在UNIX系统中共享本地文件时,调用READ的程序会观察来自最近WRITE 的数据,即使此WRITE是由另一个程序执行的。此属性称为读/写一致性(参见第 2.1.1.1 节)。如果程序在不同的客户端上运行,缓存会使正确实现这些语义变得复杂。
When programs share a local file in a UNIX system, the program calling READ observes the data from the most recent WRITE, even if this WRITE was performed by another program. This property is called read/write coherence (see Section 2.1.1.1). If the programs are running on different clients, caching complicates implementing these semantics correctly.
为了说明这个问题,请考虑一台计算机上写入文件块的用户程序。该计算机上的文件系统调用层可能会对缓存中的块执行更新,从而延迟写入服务器,就像本地UNIX文件系统延迟写入磁盘一样。如果另一台计算机上的程序随后从服务器读取文件,它可能无法观察到第一台计算机上所做的更改,因为该更改可能尚未传播到服务器。由于这种行为是不正确的,因此 NFS 实现了一种读/写一致性。
To illustrate the problem, consider a user program on one computer that writes a block of a file. The file system call layer on that computer might perform the update to the block in the cache, delaying the write to the server, just like the local UNIX file system delays a write to disk. If a program on another computer then reads the file from the server, it may not observe the change made on the first computer because that change may not have been propagated to the server yet. Because this behavior would be incorrect, NFS implements a form of read/write coherence.
NFS 可以为每个操作或某些操作保证读/写的一致性。一种选择是仅为OPEN和CLOSE提供读/写一致性。也就是说,如果应用程序在一个客户端上OPEN一个文件、WRITE并CLOSE该文件,并且如果稍后第二个客户端上的应用程序打开同一个文件,那么第二个应用程序将观察到第一个应用程序写入的结果。此选项称为接近打开一致性。另一种选择是为每次读写提供读/写一致性。也就是说,如果不同客户端上的两个应用程序同时打开同一个文件,则其中一个应用程序的READ会观察到另一个应用程序的WRITE的结果。
NFS could guarantee read/write coherence for every operation, or just for certain operations. One option is to provide read/write coherence for only OPEN and CLOSE. That is, if an application OPENs a file, WRITEs, and CLOSEs the file on one client, and if later an application on a second client opens the same file, then the second application will observe the results of the writes by the first application. This option is called close-to-open consistency. Another option is to provide read/write coherence for every read and write. That is, if two applications on different clients have the same file open concurrently, then a READ of one observes the results of WRITEs of the other.
许多 NFS 实现都提供接近打开一致性,因为它允许以更高的数据速率读取或写入大文件;客户端可以发送多个读取或写入请求,而不必在每个请求之后等待响应。图 4.13更详细地说明了接近打开语义。如果像情况 1 中那样,一个客户端上的程序调用WRITE然后调用CLOSE,然后另一个客户端调用OPEN和READ,则 NFS 实现将确保READ将包含第一个客户端的WRITE的结果。但是,像情况 2 中那样,如果两个客户端打开了同一个文件,一个客户端写入文件的一个块,然后另一个客户端调用READ,则READ可能返回上次WRITE之前或之后的数据;NFS 实现在这种情况下不提供任何保证。
Many NFS implementations provide close-to-open consistency because it allows for higher data rates for reading or writing a big file; a client can send several reads or write requests without having to wait for a response after each request. Figure 4.13 illustrates close-to-open semantics in more detail. If, as in case 1, a program on one client calls WRITE and then CLOSE, and then another client calls OPEN and READ, the NFS implementation will ensure that the READ will include the results of the WRITEs by the first client. But, as in case 2, if two clients have the same file open, one client writes a block of the file, and then the other client invokes READ, READ may return the data either from before or after the last WRITE; the NFS implementation makes no guarantees in that case.
图 4.13两种近开一致性的案例
Figure 4.13 Two cases illustrating close-to-open consistency
NFS 实现提供如下的 close-to-open 语义。客户端在其缓存中的每个数据块中存储客户端从服务器读取块时块的 vnode 的修改。当用户程序打开文件时,客户端会发送GETATTR请求以获取文件的最后修改时间。仅当块的修改时间与其 vnode 的修改时间相同时,客户端才会读取缓存的数据块。如果修改时间不同,客户端会从其缓存中删除数据块并从服务器获取它。
NFS implementations provide close-to-open semantics as follows. The client stores with each data block in its cache the modification of the block’s vnode at the time the client reads the block from the server. When a user program opens a file, the client sends a GETATTR request to fetch the last modification time of the file. The client reads a cached data block only if the block’s modification time is the same as its vnode’s modification time. If the modification times are not the same, the client removes the data block from its cache and fetches it from the server.
客户端通过修改其本地缓存版本来实现WRITE,而不会产生远程过程调用的开销。然后,在图 4.11的CLOSE调用中,客户端不会简单地返回,而是先将所有缓存的写入发送到服务器并等待确认。这种实现很简单,并且性能不错。客户端可以以本地内存速度执行READ和WRITE。通过延迟发送修改后的块直到CLOSE,客户端可以吸收被覆盖的修改(例如,应用程序多次写入同一个块)并将WRITE聚合到同一个块(例如,修改块不同部分的 WRITE)。
The client implements WRITE by modifying its local cached version, without incurring the overhead of remote procedure calls. Then, in the CLOSE call of Figure 4.11, the client, rather than simply returning, would first send any cached writes to the server and wait for an acknowledgment. This implementation is simple and provides decent performance. The client can perform READs and WRITEs at local memory speeds. By delaying sending the modified blocks until CLOSE, the client absorbs modifications that are overwritten (e.g., the application writes the same block multiple times) and aggregates WRITEs to the same block (e.g., WRITEs that modify different parts of the block).
通过提供接近开放语义,大多数为本地UNIX文件系统编写的用户程序在其文件存储在 NFS 上时都能正常工作。例如,如果用户在个人工作站上编辑程序,但更喜欢在更快的计算机上编译,则具有接近开放一致性的 NFS 可以很好地工作,不需要对编辑器和编译器进行任何修改。在编辑器写出修改后的文件并且用户在计算机上启动编译器后,编译器将观察到编辑。
By providing close-to-open semantics, most user programs written for a local UNIX file system will work correctly when their files are stored on NFS. For example, if a user edits a program on a personal workstation but prefers to compile on a faster compute machine, then NFS with close-to-open consistency works well, requiring no modifications to the editor and the compiler. After the editor has written out the modified file and the user starts the compiler on the compute machine, the compiler will observe the edits.
另一方面,某些程序在使用提供接近打开一致性的 NFS 实现时无法正常工作。例如,通过 NFS 读取和写入存储在文件中的记录的多客户端数据库程序将无法正常工作,因为如图4.13中的第二种情况所示,接近打开语义未指定客户端并发执行操作时的语义——例如,如果客户端 2 在客户端 1 关闭数据库文件之前打开数据库文件,而客户端 3 在客户端 1 关闭数据库文件之后打开数据库文件。如果客户端 2 和 3 随后从文件中读取数据,则客户端 2 可能看不到客户端 1 写入的数据,而客户端 3 将看到客户端 1 写入的数据。
On the other hand, certain programs will not work correctly using NFS implementations that provide close-to-open consistency. For example, a multiclient database program that reads and writes records stored in a file over NFS will not work correctly because, as the second case in Figure 4.13 illustrates, close-to-open semantics doesn’t specify the semantics when clients execute operations concurrently—for example, if client 2 opens the database file before client 1 closes it and client 3 opens the database file after client 1 closes it. If client 2 and 3 then read data from the file, client 2 may not see the data written by client 1, while client 3 will see the data written by client 1.
此外,由于 NFS 缓存的是块(而不是整个文件),因此文件可能混合了来自不同版本的块。当客户端获取文件时,它只会获取 inode,并且可能预取一些块。后续的READ RPC 可能会从较新版本的文件中获取块,因为另一个客户端可能在该客户端打开文件后写入了这些块。
Furthermore, because NFS caches blocks (instead of whole files), the file may have blocks from different versions of the file intermixed. When a client fetches a file, it fetches only the inode and perhaps prefetches a few blocks. Subsequent READ RPCs may fetch blocks from a newer version of the file because another client may have written those blocks after this client opened the file.
在这种情况下,要提供正确的语义需要更复杂的机制,而 NFS 实现并不提供这种机制,因为数据库通常有自己的专用解决方案,正如我们在第 9 章 [在线] 和第 10 章 [在线] 中讨论的那样。如果数据库程序不提供专用解决方案,那么运气不好——无法在 NFS 上运行它。
To provide the correct semantics in this case requires more sophisticated machinery, which NFS implementations don’t provide, because databases often have their own special-purpose solutions anyway, as we discuss in Chapters 9 [on-line] and 10 [on-line]. If the database program doesn’t provide a special-purpose solution, then tough luck—one cannot run it over NFS.
NFS 版本 2 正在被 NFS 版本 3 取代。版本 3 解决了版本 2 中的许多限制,但扩展并没有显著改变前面的描述。例如,版本 3 支持使用 64 位数字来记录文件大小,并添加了异步写入(即,服务器可以在收到请求后立即确认异步写入请求,然后再将数据写入磁盘)。
NFS version 2 is being replaced by NFS version 3. Version 3 addresses a number of limitations in version 2, but the extensions do not significantly change the preceding description. For example, version 3 supports 64-bit numbers for recording file sizes and adds an asynchronous write (i.e., the server may acknowledge an asynchronous WRITE request as soon as it receives the request, before it has written the data to disk).
NFS 版本 4 吸取了 Andrew 文件系统 [建议进一步阅读 4.2.3 ]的许多经验教训,与版本 3 相比变化更大;在版本 4 中,服务器会维护一些状态。版本 4 还可以防止入侵者使用第 11 章 [在线] 中讨论的技术窥探和修改网络流量。此外,它提供了一种更有效的方案来提供接近开放的一致性,并且它在 Internet 上工作良好,因为客户端和服务器可能使用低速链路进行连接。
NFS version 4, which took a number of lessons from the Andrew File System [Suggestions for Further Reading 4.2.3], is a bigger change than version 3; in version 4 the server maintains some state. Version 4 also protects against intruders who can snoop and modify network traffic using techniques discussed in Chapter 11 [on-line]. Furthermore, it provides a more efficient scheme for providing close-to-open consistency, and it works well across the Internet, where the client and server may be connected using low-speed links.
以下参考资料提供了有关 NFS 的更多详细信息:
The following references provide more details on NFS:
1.Russel Sandberg、David Goldberg、Steve Kleiman、Dan Walsh 和 Bob Lyon。“Sun 网络文件系统的设计和实现”,1985 年夏季 Usenix 技术会议论文集,1985 年 6 月,加利福尼亚州埃尔塞里托,第 119-130 页。
1. Russel Sandberg, David Goldberg, Steve Kleiman, Dan Walsh, and Bob Lyon. “Design and implementation of the Sun network file system”, Proceedings of the 1985 Summer Usenix Technical Conference, June 1985, El Cerrito, CA, pages 119–130.
2.Chet Juszezak,“提高 NFS 服务器的性能和正确性”,1989 年冬季 Usenix 技术会议论文集,1989 年 1 月,加利福尼亚州伯克利,第 53-63 页。
2. Chet Juszezak, “Improving the performance and correctness of an NFS server”, Proceedings of the 1989 Winter Usenix Technical Conference, January 1989, Berkeley, CA, pages 53–63.
3.Brian Pawlowski、Chet Juszezak、Peter Staubach、Carl Smith、Diana Lebel 和 David Hitz,《NFS 版本 3 的设计和实施》,1990 年夏季 Usenix 技术会议论文集,1994 年 6 月,马萨诸塞州波士顿。
3. Brian Pawlowski, Chet Juszezak, Peter Staubach, Carl Smith, Diana Lebel, and David Hitz, “NFS Version 3 design and implementation”, Proceedings of the 1990 Summer Usenix Technical Conference, June 1994, Boston, MA.
4.Brian Pawlowski、Spencer Shepler、Carl Beame、Brent Callaghan、Michael Eisler、David Noveck、David Robinson 和 Robert Turlow,《NFS 版本 4 协议》,第二届国际 SANE 会议论文集,2000 年 5 月,荷兰马斯特里赫特。
4. Brian Pawlowski, Spencer Shepler, Carl Beame, Brent Callaghan, Michael Eisler, David Noveck, David Robinson, and Robert Turlow, “The NFS Version 4 protocol”, Proceedings of Second International SANE Conference, May 2000, Maastricht, The Netherlands.
4.1当客户端和服务之间强制模块化时,服务实现中的错误就无法传播到其客户端。对还是错?解释一下。
4.1 When modularity between a client and a service is enforced, there is no way for errors in the implementation of the service to propagate to its clients. True or False? Explain.
1995–1–1天
1995–1–1d
4.2第 1 章讨论了应对复杂性的四种通用方法:模块化、抽象、层次化和分层。
4.2 Chapter 1 discussed four general methods for coping with complexity: modularity, abstraction, hierarchy, and layering.
4.2a客户端/服务使用这四种方法中的哪一种作为其主要组织方案?
4.2a Which of those four methods does client/service use as its primary organizing scheme?
4.3对于客户端软件来说,远程过程调用与普通的本地过程调用的一个显著区别是:
4.3 To client software, a notable difference between remote procedure call and ordinary local procedure call is:
A. None. That’s the whole point of RPC!
B. There may be multiple returns from one RPC call.
C. There may be multiple calls for one RPC return.
D. Recursion doesn’t work in RPC.
E. The runtime system may report a new type of error as a result of an RPC.
4.4以下哪项陈述对于 X Window 系统是正确的(参见边栏 4.4)?
4.4 Which of the following statements is true of the X Window System (see Sidebar 4.4)?
A。X 服务器是一个值得信赖的中介,它试图在 X 客户端之间强制使用显示资源的模块化。
A. The X server is a trusted intermediary and attempts to enforce modularity between X clients in their use of the display resource.
B.X 客户端始终会等待对请求的响应,然后再向 X 服务器发送下一个请求。
B. An X client always waits for a response to a request before sending the next request to the X server.
C。当另一台计算机上运行的程序在您的本地工作站上显示其窗口时,该远程计算机被视为 X 服务器。
C. When a program running on another computer displays its window on your local workstation, that remote computer is considered an X server.
4.5在浏览 Web 时,您单击一个链接,该链接标识了一个名为www.cslab.scholarly.edu的 Internet 主机。您的浏览器要求您的域名系统 (DNS) 名称服务器M查找此域名的 Internet 地址。在什么条件下,以下关于名称解析过程的陈述是正确的?
4.5 While browsing the Web, you click on a link that identifies an Internet host named www.cslab.scholarly.edu. Your browser asks your Domain Name System (DNS) name server, M, to find an Internet address for this domain name. Under what conditions is each of the following statements true of the name resolution process?
A. To answer your query, M must contact one of the root name servers.
B.如果M过去回答过对www.cslab.scholarly.edu的查询,那么它就可以回答您的查询,而无需询问任何其他名称服务器。
B. If M answered a query for www.cslab.scholarly.edu in the past, then it can answer your query without asking any other name server.
C。M 必须联系cslab.scholarly.edu的其中一个名称服务器来解析域名。
C. M must contact one of the name servers for cslab.scholarly.edu to resolve the domain name.
D.如果M缓存了scholarly.edu当前工作名称服务器的互联网地址,那么该名称服务器将能够直接提供答复。
D. If M has the current Internet address of a working name server for scholarly.edu cached, then that name server will be able to directly provide an answer.
E.如果M缓存了cslab.scholarly.edu的工作名称服务器的当前互联网地址,那么该名称服务器将能够直接提供答复。
E. If M has the current Internet address of a working name server for cslab.scholarly.edu cached, then that name server will be able to directly provide an answer.
4.6对于与练习 4.5中相同的情况,假设所有名称服务器都配置正确且没有消息丢失,则以下哪项对于名称解析过程始终是正确的?
4.6 For the same situation as in Exercise 4.5, which of the following is always true of the name resolution process, assuming that all name servers are configured correctly and no messages are lost?
A。如果 M在过去某个时间回答过针对www.cslab.scholarly.edu对应的 IP 地址的查询,那么它就可以响应当前查询,而无需联系任何其他名称服务器。
A. If M had answered a query for the IP address corresponding to www.cslab.scholarly.edu at some time in the past, then it can respond to the current query without contacting any other name server.
B.如果 M 在其缓存中具有cslab.scholarly.edu的正常运行的名称服务器的有效 IP 地址,则 M 将从该名称服务器获得响应,而无需联系任何其他名称服务器。
B. If M has a valid IP address of a functioning name server for cslab.scholarly.edu in its cache, then M will get a response from that name server without any other name servers being contacted.
4.7第 4.5 节中描述的网络文件系统 (NFS)允许客户端计算机对存储在远程服务器上的文件运行操作。对于此处描述的 NFS 版本,判断以下每个断言是正确还是错误:
4.7 The Network File System (NFS) described in Section 4.5 allows a client machine to run operations on files that are stored at a remote server. For the version of NFS described there, decide if each of these assertions is true or false:
A。当服务器响应客户端的WRITE调用时,该WRITE所需的所有修改都将写入服务器的磁盘。
A. When the server responds to a client’s WRITE call, all modifications required by that WRITE will have made it to the server’s disk.
B.NFS 客户端可能会向 NFS 服务器发送同一操作的多个请求。
B. An NFS client might send multiple requests for the same operation to the NFS server.
C。当 NFS 服务器崩溃时,在操作系统重新启动并恢复磁盘内容后,服务器还必须运行自己的恢复过程,以使其状态与客户端的状态保持一致。
C. When an NFS server crashes, after the operating system restarts and recovers the disk contents, the server must also run its own recovery procedure to make its state consistent with that of its clients.
4.8假设 NFS(如第 4.5 节所述)服务器包含文件/a/b,并且 NFS 客户端将 NFS 服务器的根目录挂载到位置/x,因此客户端现在可以将文件命名为/x/a/b。进一步假设这是唯一的客户端,并且客户端执行以下两个命令:
4.8 Assume that an NFS (described in Section 4.5) server contains a file /a/b and that an NFS client mounts the NFS server’s root directory in the location /x, so that the client can now name the file as /x/a/b. Further assume that this is the only client and that the client executes the following two commands:
客户端向服务器发送的REMOVE消息成功发送,服务器删除了文件。不幸的是,服务器向客户端发送的响应丢失了,客户端重新发送消息以删除(现在不存在的)文件。服务器收到重新发送的消息。接下来会发生什么取决于服务器的实现。以下哪些表述是正确的?
The REMOVE message from the client to the server gets through, and the server removes the file. Unfortunately, the response from the server to the client is lost and the client resends the message to remove the (now non-existent) file. The server receives the resent message. What happens next depends on the server implementation. Which of the following are correct statements?
A。如果服务器维护内存中的回复缓存,其中记录了它以前执行的所有操作,并且没有服务器故障,则服务器将返回“OK”。
A. If the server maintains an in-memory reply cache in which it records all operations it previously executed, and there are no server failures, the server will return “OK”.
B.如果服务器维护内存中的回复缓存,但是服务器发生故障、重新启动并且其回复缓存为空,则可能出现以下两种响应:服务器可能返回“未找到文件”或“OK”。
B. If the server maintains an in-memory reply cache but the server has failed, restarted, and its reply cache is empty, both of the following responses are possible: the server may return “file not found” or “OK”.
C. If the server is stateless, it will return “file not found”.
D.因为REMOVE是幂等操作,所以任何服务器实现都会返回“OK”。
D. Because REMOVE is an idempotent operation, any server implementation will return “OK”.
2006–2–2 与第 4 章相关的附加练习可以在从第 425 页开始的问题集中找到。
2006–2–2 Additional exercises relating to Chapter 4 can be found in the problem sets beginning on page 425.
5.1 Client/Server Organization within a Computer Using Virtualization
5.1.1虚拟化计算机的抽象
5.1.1 Abstractions for Virtualizing Computers
5.1.2仿真和虚拟机
5.1.2 Emulation and Virtual Machines
5.1.3路线图:逐步实现虚拟化
5.2 Virtual Links Using SEND, RECEIVE, and a Bounded Buffer
5.2.1 An Interface for SEND and RECEIVE with Bounded Buffers
5.2.2具有有限缓冲区的序列协调
5.2.2 Sequence Coordination with a Bounded Buffer
5.2.3竞争条件
5.2.3 Race Conditions
5.2.4锁定和前后操作
5.2.4 Locks and Before-or-After Actions
5.2.5僵局
5.2.5 Deadlock
5.2.6实现ACQUIRE和RELEASE
5.2.6 Implementing ACQUIRE and RELEASE
5.2.7使用单一作者原则实施前后操作
5.2.7 Implementing a Before-or-After Action Using the One-Writer Principle
5.2.8具有异步连接的同步岛之间的协调
5.2.8 Coordination between Synchronous Islands with Asynchronous Connections
5.3加强内存模块化
5.3 Enforcing Modularity in Memory
5.3.1使用域强制模块化
5.3.1 Enforcing Modularity with Domains
5.3.2使用多个域进行受控共享
5.3.2 Controlled Sharing Using Several Domains
5.3.3通过内核和用户模式实现更强的模块化
5.3.3 More Enforced Modularity with Kernel and User Mode
5.3.4大门和改变模式
5.3.4 Gates and Changing Modes
5.3.5强制有界缓冲区模块化
5.3.5 Enforcing Modularity for Bounded Buffers
5.3.6内核
5.3.6 The Kernel
5.4虚拟化内存
5.4.1虚拟化地址
5.4.1 Virtualizing Addresses
5.4.2使用页面映射转换地址
5.4.2 Translating Addresses Using a Page Map
5.4.3虚拟地址空间
5.4.3 Virtual Address Spaces
5.4.4硬件与软件以及翻译后备缓冲器
5.4.4 Hardware versus Software and the Translation Look-Aside Buffer
5.4.5段(高级主题)
5.5使用线程虚拟化处理器
5.5 Virtualizing Processors Using Threads
5.5.1在多个线程之间共享处理器
5.5.1 Sharing a Processor Among Multiple Threads
5.5.2实现YIELD
5.5.2 Implementing YIELD
5.5.3创建和终止线程
5.5.3 Creating and Terminating Threads
5.5.4使用线程强制模块化:抢占式调度
5.5.4 Enforcing Modularity with Threads: Preemptive Scheduling
5.5.5使用线程和地址空间实现模块化
5.5.5 Enforcing Modularity with Threads and Address Spaces
5.5.6分层线程
5.5.6 Layering Threads
5.6用于序列协调的线程原语
5.6 Thread Primitives for Sequence Coordination
5.6.1丢失通知问题
5.6.1 The Lost Notification Problem
5.6.2使用 Eventcounts 和 Sequencers 避免丢失通知问题
5.6.2 Avoiding the Lost Notification Problem with Eventcounts and Sequencers
5.6.3实现AWAIT、ADVANCE、TICKET和READ(高级主题)
5.6.3 Implementing AWAIT, ADVANCE, TICKET, and READ (Advanced Topic)
5.6.4轮询、中断和序列协调
5.7 Case Study: Evolution of Enforced Modularity in the Intel x86
5.7.1早期设计:不支持强制模块化
5.7.1 The Early Designs: No Support for Enforced Modularity
5.7.2使用分段实现模块化
5.7.2 Enforcing Modularity Using Segmentation
5.7.3基于页面的虚拟地址空间
5.7.3 Page-Based Virtual Address Spaces
5.7.4摘要:更多进化
5.7.4 Summary: More Evolution
5.8 Application: Enforcing Modularity Using Virtual Machines
5.8.1虚拟机用途
5.8.1 Virtual Machine Uses
5.8.2实现虚拟机
5.8.2 Implementing Virtual Machines
5.8.3虚拟化示例
5.8.3 Virtualizing Example
客户端/服务组织的目标是将客户端和服务之间的交互限制在消息中。为了确保不存在隐藏交互的机会,上一章假设每个客户端模块和服务模块都在单独的计算机上运行。在此假设下,计算机之间的网络强制模块化。这种实现减少了编程错误从一个模块传播到另一个模块的机会,但它也有利于实现安全性(因为只能通过发送消息才能穿透服务模块)和容错性(服务模块可以在地理上分开,这降低了地震或大规模停电等灾难性故障影响所有实施服务的服务器的风险)。
The goal of the client/service organization is to limit the interactions between clients and services to messages. To ensure that there are no opportunities for hidden interactions, the previous chapter assumed that each client module and service module runs on a separate computer. Under that assumption, the network between the computers enforces modularity. This implementation reduces the opportunity for programming errors to propagate from one module to another, but it is also good for achieving security (because the service module can be penetrated only by sending messages) and fault tolerance (service modules can be separated geographically, which reduces the risk that a catastrophic failure such as an earthquake or a massive power failure affects all servers that implement the service).
每个模块使用一台计算机的主要缺点是它需要与模块数量相同的计算机。由于系统及其应用程序的模块化不应由可用的计算机数量决定,因此这种要求是不可取的。如果设计人员决定将系统或应用程序拆分为n 个模块,并希望在它们之间强制模块化,则n的选择不应受到恰好有库存且容易获得的计算机数量的限制。相反,设计人员需要一种方法来在同一台计算机上运行多个模块,而无需诉诸软模块化。
The main disadvantage of using one computer per module is that it requires as many computers as modules. Since the modularity of a system and its applications shouldn’t be dictated by the number of computers available, this requirement is undesirable. If the designer decides to split a system or application into n modules and would like to enforce modularity between them, the choice of n should not be constrained by the number of computers that happen to be in stock and easily obtained. Instead, the designer needs a way to run several modules on the same computer without resorting to soft modularity.
本章介绍了虚拟化作为实现此目标的主要方法,并介绍了三种新的抽象(具有有限缓冲区的SEND和RECEIVE、虚拟内存和线程),它们对应于三个主要抽象(通信链路、内存和处理器)的虚拟化版本。这三种新的抽象允许设计人员实现运行所需的n 个模块所需的虚拟计算机数量。
This chapter introduces virtualization as the primary approach to achieve this goal and presents three new abstractions (SEND and RECEIVE with bounded buffers, virtual memory, and threads) that correspond to virtualized versions of the three main abstractions (communication links, memory, and processors). The three new abstractions allow a designer to implement as many virtual computers as needed for running the desired n modules.
为了强制在同一台计算机上运行的模块之间的模块化,我们使用一台物理计算机创建多台虚拟计算机,并在自己的虚拟计算机中执行每个模块(通常是应用程序或子系统)。
To enforce modularity between modules running on the same computer, we create several virtual computers using one physical computer and execute each module (usually an application or a subsystem) in its own virtual computer.
这个想法可以通过使用一种称为虚拟化的技术来实现。虚拟化物理对象的程序会模拟物理对象的接口,但它通过多路复用一个物理实例来创建许多虚拟对象,或者它可以通过聚合许多物理实例来提供一个大型虚拟对象,或者使用模拟从不同类型的物理对象实现虚拟对象。对于模拟对象的用户来说,它提供与物理实例相同的行为,但它不是物理实例,这就是它被称为虚拟的原因。虚拟化的主要目标是保留现有接口。这样,设计用于使用对象物理实例的模块不必修改即可使用虚拟实例。图 5.1给出了三种虚拟化方法的一些示例,我们将依次讨论。
This idea can be realized using a technique called virtualization. A program that virtualizes a physical object simulates the interface of the physical object, but it creates many virtual objects by multiplexing one physical instance, or it may provide one large virtual object by aggregating many physical instances, or implement a virtual object from a different kind of physical object using emulation. For the user of the simulated object, it provides the same behavior as a physical instance, but it isn’t the physical instance, which is why it is called virtual. A primary goal of virtualization is to preserve an existing interface. That way, modules designed to use a physical instance of an object don’t have to be modified to use a virtual instance. Figure 5.1 gives some examples of the three virtualization methods, which we will discuss in turn.
图 5.1虚拟化的示例。
Figure 5.1 Examples of virtualization.
在一台物理服务器上托管多个网站是涉及多路复用的虚拟化的一个例子。如果网站的总峰值负载低于单台服务器计算机所能支持的负载,提供商通常更愿意使用一台服务器托管多个网站,因为这比为每个网站购买一台服务器更便宜。
Hosting several Web sites on a single physical server is an example of virtualization involving multiplexing. If the aggregate peak load of the Web sites is less than what a single server computer can support, providers often prefer to use a single server to host several Web sites because it is less expensive than buying one server for each Web site.
接下来的三个示例与线程和虚拟内存有关,我们将在5.1.1 节中概述。其中一些用法不依赖于单一的虚拟化方法,而是结合了几种方法或使用不同的方法来获得不同的属性。例如,具有分页的虚拟内存(在6.2 节中描述)同时使用了多路复用和聚合。
The next three examples relate to threads and virtual memory, which we will overview in Section 5.1.1. Some of these usages don’t rely on a single method of virtualization but combine several or use different methods to obtain different properties. For example, virtual memory with paging (described in Section 6.2) uses both multiplexing and aggregation.
虚拟电路使用多路复用技术将线路或通信信道虚拟化。例如,它允许使用称为时分多路复用的技术在一条线路上进行多通电话通话,我们将在第 7 章 [在线] 中讨论。信道绑定将多个通信信道聚合在一起,以提供组合的高数据速率。
A virtual circuit virtualizes a wire or communication channel using multiplexing. For example, it allows several phone conversations to take place over a single wire with a technique called time division multiplexing, as we will discuss in Chapter 7 [on-line]. Channel bonding aggregates several communication channels to provide a combined high data rate.
RAID(参见第 2.1.1.4 节)是涉及聚合的虚拟化的一个示例。在 RAID 中,多个磁盘以一种巧妙的方式聚合在一起,提供与单个磁盘相同的接口,但这些磁盘组合在一起可以提供更好的性能(通过同时读取和写入磁盘)和耐用性(通过在多个磁盘上写入信息)。系统管理员可以用 RAID 替换单个磁盘,并利用 RAID 的改进,而无需更改文件系统。
RAID (see Section 2.1.1.4) is an example of virtualization involving aggregation. In RAID, a number of disks are aggregated together in a clever way that provides an identical interface to the one of a single disk, but together the disks provide improved performance (by reading and writing disks concurrently) and durability (by writing information on more than one disk). A system administrator can replace a single disk with a RAID and take advantage of the RAID improvements without having to change the file system.
RAM 磁盘是涉及仿真的虚拟化的一个示例。RAM 磁盘提供与物理磁盘相同的接口,但将块存储在内存中而不是磁盘盘片上。因此,RAM 磁盘可以比物理磁盘更快地读取和写入块,但由于 RAM 易失性,因此耐用性较差。管理员可以配置文件系统以使用 RAM 磁盘而不是物理磁盘,而无需修改文件系统本身。例如,系统管理员可以配置文件系统以使用 RAM 磁盘存储临时文件,这允许文件系统快速读取和写入临时文件。而且由于临时文件不必持久存储,因此将它们存储在 RAM 磁盘上不会有任何损失。
A RAM disk is an example of virtualization involving emulation. A RAM disk provides the same interface as a physical disk but stores blocks in memory instead of on a disk platter. RAM disks can therefore read and write blocks much faster than a physical disk but, because RAM is volatile, it provides little durability. Administrators can configure a file system to use a RAM disk instead of a physical disk without needing to modify the file system itself. For example, a system administrator may configure the file system to use RAM disk to store temporary files, which allows the file system to read and write temporary files fast. And since temporary files don’t have to be stored durably, nothing is lost by storing them on a RAM disk.
虚拟 PC 是使用仿真进行虚拟化的一个例子。它允许从物理个人计算机构建虚拟个人计算机,可以是不同类型的计算机(例如,使用 Macintosh 模拟虚拟 PC)。虚拟 PC 可用于在一台计算机上运行多个操作系统,或简化新操作系统的测试和开发。第 5.1.2 节更详细地讨论了这种虚拟化技术。
A virtual PC is an example of virtualization using emulation. It allows the construction of a virtual personal computer out of a physical personal computer, perhaps of a different type (e.g., using a Macintosh to emulate a virtual PC). Virtual PCs can be useful to run several operating systems on a single computer, or simplify the testing and the development of a new operating system. Section 5.1.2 discusses this virtualization technique in more detail.
设计师经常倾向于对界面进行细微的修改,而不是完全虚拟化它,以改进它或添加有用的功能。这种修改很容易越界,从而违背了不必修改使用该界面的其他模块的初衷。例如,Sidebar 4.4中描述的 X Window 系统实现了可以视为虚拟显示器的对象,但由于这些对象的大小可以动态更改,并且绘制它们的程序应该准备好根据命令重新绘制它们,因此将它们称为“窗口”更为合适。
Designers are often tempted to tinker slightly with an interface rather than virtualizing it exactly, to improve it or to add a useful feature. Such tinkering can easily cross a line in which the original goal of not having to modify other modules that use the interface is lost. For example, the X Window System described in Sidebar 4.4 implements objects that could be thought of as virtual displays, but because the size of those objects can be changed on the fly and the program that draws on them should be prepared to redraw them on command, it is more appropriate to call them “windows”.
类似地,文件系统(参见第 2.3.2 节)创建存储位的对象,因此与虚拟化硬盘有某些相似之处,但由于文件具有名称、长度可调整、允许受控共享,并可组织成层次结构,因此更适合将它们视为不同的内存抽象。
Similarly, a file system (see Section 2.3.2) creates objects that store bits and thus has some similarity to a virtualized hard disk, but because files have names, are of adjustable length, allow controlled sharing, and can be organized into hierarchies, they are more appropriately thought of as a different memory abstraction.
前面的例子说明了我们如何在一台计算机中实现客户端/服务组织。考虑一台计算机,我们想在其上运行五个模块:文本编辑器、电子邮件阅读器、键盘管理器、窗口服务和文件服务。当用户使用文本编辑器时,键盘输入应该转到编辑器。当用户将鼠标从编辑器窗口移动到邮件阅读器窗口时,下一个键盘输入应该转到邮件阅读器。当文本编辑器保存文件时,必须执行文件服务来存储文件。如果模块比计算机多,则需要某种解决方案来共享一台计算机。
The preceding examples suggest how we could implement the client/service organization within a single computer. Consider a computer on which we would like to run five modules: a text editor, an e-mail reader, a keyboard manager, the window service, and the file service. When a user works with the text editor, keyboard input should go to the editor. When the user moves the mouse from the editor window to the mail reader window, the next keyboard input should go to the mail reader. When the text editor saves a file, the file service must execute to store the file. If there are more modules than computers, some solution is needed for sharing a single computer.
这个想法是为每个模块提供自己的虚拟计算机。这个想法的强大之处在于程序员可以独立思考每个模块。从程序员的角度来看,每个程序模块都有自己的虚拟计算机,独立于其他模块的虚拟计算机执行。这个想法加强了模块化,因为虚拟计算机可以包含模块的错误,并且没有模块可以停止其他模块的进度。
The idea is to present each module with its own virtual computer. The power of this idea is that programmers can think of each module independently. From the programmer’s perspective, every program module has a virtual computer to itself, which executes independently of the virtual computers of other modules. This idea enforces modularity because a virtual computer can contain a module’s errors and no module can halt the progress of other modules.
虚拟计算机设计并不强制模块化,也不强制在物理上独立的计算机上运行模块,因为例如,电源故障将使同一物理计算机上的所有虚拟计算机瘫痪。此外,一旦攻击者侵入一台虚拟计算机,攻击者可能会发现一种利用虚拟化实施中的漏洞侵入其他虚拟计算机的方法。采用虚拟计算机的主要模块化目标是确保由于意外编程错误导致的模块故障不会从一台虚拟计算机传播到另一台虚拟计算机。虚拟计算机可以有助于实现安全目标,但最好将其视为几道防线之一。
The virtual computer design does not enforce modularity as well as running modules on physically separate computers because, for example, a power failure will knock out all virtual computers on the same physical computer. Also, once an attacker has broken into one virtual computer, the attacker may discover a way to exploit a flaw in the implementation of virtualization to break into other virtual computers. The primary modularity goal of employing virtual computers is to ensure that module failures due to accidental programming errors don’t propagate from one virtual computer to another. Virtual computers can contribute to security goals but are better viewed as only one of several lines of defense.
实现虚拟计算机的主要挑战是找到构建它们的正确抽象。本章介绍了三种与主要抽象的虚拟化版本相对应的抽象:具有有界缓冲区的SEND和RECEIVE(虚拟化通信链路)、虚拟内存(虚拟化内存)和线程(虚拟化处理器)。
The main challenge in implementing virtual computers is finding the right abstractions to build them. This chapter introduces three abstractions that correspond to virtualized versions of the main abstractions: SEND and RECEIVE with bounded buffers (virtualizes communication links), virtual memory (virtualizes memory), and threads (virtualizes processors).
这三种抽象通常由称为操作系统的程序实现(在侧边栏 2.4中简要讨论过,但将在本章中详细讨论)。使用提供这三种抽象的操作系统,我们可以在单台计算机内实现客户端/服务组织(见图5.2)。例如,通过这种设计,在一台虚拟计算机上运行的文本编辑器可以通过虚拟通信链路向在另一台虚拟计算机上运行的文件服务发送消息,并要求它执行诸如保存文件等操作。图中每台虚拟计算机都有一个虚拟处理器(由线程实现)和自己的虚拟内存,虚拟地址空间范围从 0 到 2 n。为了直观地了解这些抽象并了解如何使用它们来实现虚拟计算机,我们对它们进行简要概述。
These three abstractions are typically implemented by a program that is called the operating system (which was briefly discussed in Sidebar 2.4 but will be discussed in detail in this chapter). Using an operating system that provides the three abstractions, we can implement the client/service organization within a single computer (see Figure 5.2). For example, with this design the text editor running on one virtual computer can send a message over the virtual communication link to the file service, running on a different virtual computer, and ask it, for example, to save a file. In the figure each virtual computer has one virtual processor (implemented by a thread) and its own virtual memory with a virtual address space ranging from 0 to 2n. To build an intuition for these abstractions and learn how they can be used to implement a virtual computer, we give a brief overview of them.
图 5.2操作系统为编辑器和文件服务模块提供各自的虚拟计算机。每个虚拟计算机都有一个线程来虚拟化处理器。每个虚拟计算机都有一个虚拟内存,为每个模块提供拥有自己内存的幻觉。为了允许虚拟计算机之间的通信,操作系统提供了SEND、RECEIVE和一个有界的消息缓冲区。
Figure 5.2 An operating system providing the editor and file service module each their own virtual computer. Each virtual computer has a thread that virtualizes the processor. Each virtual computer has a virtual memory that provides each module with the illusion that it has its own memory. To allow communication between virtual computers, the operating system provides SEND, RECEIVE, and a bounded buffer of messages.
虚拟化计算机的第一步是虚拟化处理器。为了给编辑器模块(如图 5.3所示)提供一个虚拟处理器,我们创建了一个执行线程,简称线程。线程是一种抽象,它封装了活动计算的执行状态。它封装了执行计算的概念解释器的状态(参见第 2.1.2 节)。线程的状态由解释器内部的变量(例如处理器寄存器)组成,其中包括
The first step in virtualizing a computer is to virtualize the processor. To provide the editor module (shown in Figure 5.3) with a virtual processor, we create a thread of execution, or thread for short. A thread is an abstraction that encapsulates the execution state of an active computation. It encapsulates the state of a conceptual interpreter that executes the computation (see Section 2.1.2). The state of a thread consists of the variables internal to the interpreter (e.g., processor registers), which include
图5.3编辑器模块程序示意图。
Figure 5.3 Sketch of the program for the editor module.
1. A reference to the next program step (e.g., a program counter)
2. References to the environment (e.g., a stack, a heap, and other current objects)
线程抽象封装了足够多的解释器状态,因此可以随时停止线程,稍后再恢复。停止线程并在稍后恢复的能力允许解释器虚拟化,并提供一种方便的复用物理处理器的方法。线程是虚拟化物理处理器最广泛使用的实现策略。事实上,这种实现策略非常普遍,以至于在虚拟化物理处理器的背景下,“线程”和“虚拟处理器”这两个词在实践中已成为同义词。
The thread abstraction encapsulates enough of the interpreter’s state that one can stop a thread at any point in time, and later resume it. The ability to stop a thread and resume it later allows virtualization of the interpreter and provides a convenient way of multiplexing a physical processor. Threads are the most widely used implementation strategy to virtualize physical processors. In fact, this implementation strategy is so common that in the context of virtualizing physical processors the words “thread” and “virtual processor” have become synonyms in practice.
接下来的几段概述了如何使用线程来虚拟化物理处理器。用户可以键入要运行的模块的名称,也可以从下拉菜单中选择名称。然后,命令行解释器或窗口系统可以按如下方式启动程序:
The next few paragraphs give a high-level overview of how threads can be used to virtualize physical processors. A user might type the name of the module that the user wants to run, or a user might select the name from a pull-down menu. The command line interpreter or the window system can then start the program as follows:
1. Load the program’s text and data stored in the file system into memory.
2.分配一个线程并在指定地址启动它。分配线程涉及分配一个堆栈以允许线程进行过程调用、将SP寄存器设置为堆栈顶部以及将PC寄存器设置为起始地址。
2. Allocate a thread and start it at a specified address. Allocating a thread involves allocating a stack to allow the thread to make procedure calls, setting the SP register to the top of the stack, and setting the PC register to the starting address.
一个模块可能有一个或多个线程。只有一个线程(因此只有一个处理器)的模块很常见,因为这样程序员就可以将其视为串行执行程序:它从头开始,计算(可能通过执行对服务的远程过程调用来产生一些输出),然后终止。这种简单的结构遵循程序员最不惊讶的原则。人类更善于理解串行程序,而不是理解具有多个并发线程的程序,后者可能会有令人惊讶的行为。
A module may have one or several threads. A module with only one thread (and thus one processor) is common because then the programmer can think of it as executing a program serially: it starts at the beginning, computes (perhaps producing some output by performing a remote procedure call to a service), and then terminates. This simple structure follows the principle of least astonishment for programmers. Humans are better at understanding serial programs than at understanding programs that have several, concurrent threads, which can have surprising behavior.
通过创建多个线程,模块可以拥有多个线程。例如,模块可以为其管理的每个设备创建一个线程,以便模块可以同时操作设备。或者,模块可以创建多个线程,通过在另一个线程中运行昂贵的操作来重叠昂贵操作(例如,等待磁盘)的延迟。模块可以分配多个线程以利用同时运行每个线程的多个物理处理器。服务模块可以创建多个线程来同时处理来自不同客户端的请求。
Modules may have more than one thread by creating several threads. A module, for example, may create a thread per device that the module manages so that the module can operate the devices concurrently. Or a module may create several threads to overlap the latency of an expensive operation (e.g., waiting for a disk) by running the expensive operation in another thread. A module may allocate several threads to exploit several physical processors that run each thread concurrently. A service module may create several threads to process requests from different clients concurrently.
线程抽象由线程管理器实现。线程管理器的工作是在计算机有限数量的物理处理器上多路复用可能很多的线程,并且确保一个线程中的编程错误不会干扰另一个线程的执行。由于线程封装了足够的状态,因此可以在任何时间点停止线程,稍后再恢复它,因此线程管理器可以停止一个线程并通过恢复该线程将释放的物理处理器分配给另一个线程。稍后,线程管理器可以通过为该线程重新分配物理处理器来再次恢复挂起的线程。通过这种方式,线程管理器可以在多个物理处理器上多路复用许多线程。线程管理器可以通过强制每个线程在时钟中断时定期放弃其物理处理器来确保没有线程独占物理处理器。
The thread abstraction is implemented by a thread manager. The thread manager’s job is to multiplex the possibly many threads on the limited number of physical processors of the computer, and in such a way that a programming error in one thread cannot interfere with the execution of another thread. Since the thread encapsulates enough of the state so that one can stop a thread at any point in time, and later resume it, the thread manager can stop a thread and allocate the released physical processor to another thread by resuming that thread. Later the thread manager can resume the suspended thread again by reallocating a physical processor to that thread. In this way, the thread manager can multiplex many threads across a number of physical processors. The thread manager can ensure that no thread hogs a physical processor by forcing each thread to periodically give up its physical processor on a clock interrupt.
随着线程的引入,细化第 2 章中描述的中断机制的描述将大有裨益。外部事件(例如,时钟中断或磁盘发出 I/O 完成信号)会中断物理处理器,但该事件可能与物理处理器上运行的线程无关。发生中断时,处理器会调用中断处理程序,并从处理程序返回后继续运行中断前在物理处理器上运行的线程。如果一个处理器由于正在忙于处理中断而不应中断,则下一个中断可能会中断计算机中的另一个处理器,从而允许同时处理中断。
With the introduction of threads, it is helpful to refine the description of the interrupt mechanism described in Chapter 2. External events (e.g., a clock interrupt or a magnetic disk signals the completion of an I/O) interrupt a physical processor, but the event may have nothing to do with the thread running on the physical processor. On an interrupt, the processor invokes the interrupt handler and after returning from the handler continues running the thread that was running on the physical processor before the interrupt. If one processor should not be interrupted because it is already busy processing an interrupt, the next interrupt may interrupt another processor in the computer, allowing interrupts to be processed concurrently.
某些中断确实与当前正在运行的线程有关。我们将这类中断称为异常。异常处理程序在被中断线程的上下文中运行;它可以读取和修改被中断线程的状态。异常通常发生在线程执行硬件无法完成的某些操作(例如,除以零)时。许多编程语言也有异常的概念;例如,如果平方根程序的调用者向其传递一个负参数,它可能会发出异常信号。我们将看到,由于异常处理程序在被中断线程的上下文中运行,而中断处理程序在操作系统的上下文中运行,因此两种处理程序可以安全执行的操作有不同的限制。
Some interrupts do pertain to the currently running thread. We shall refer to this class of interrupts as exceptions. The exception handler runs in the context of the interrupted thread; it can read and modify the interrupted thread’s state. Exceptions often happen when a thread performs some operation that the hardware cannot complete (e.g., divide by zero). Many programming languages also have a notion of an exception; for example, a square root program may signal an exception if its caller hands it a negative argument. We shall see that because exception handlers run in the context of the interrupted thread, but interrupt handlers run in the context of the operating system, there are different restrictions on what the two kinds of handlers can safely do.
如上所述,所有线程和处理程序共享相同的物理内存。运行线程的每个处理器都会通过总线发送READ和WRITE请求以及标识要读取或写入的内存位置的地址。共享内存有好处,但不受控制的共享很容易出错。如果多个线程将其程序和数据存储在同一个物理内存中,则每个模块的线程都可以访问其他每个模块的数据。事实上,一个简单的编程错误(例如,程序计算了错误的地址)可能会导致STORE指令覆盖另一个模块的数据或JMP指令执行另一个模块的过程。因此,如果没有内存执行机制,我们最多只能实现软模块化。此外,物理内存和地址空间可能太小而无法容纳应用程序,需要应用程序仔细管理内存。
As described so far, all threads and handlers share the same physical memory. Each processor running a thread sends READ and WRITE requests across a bus along with an address identifying the memory location to be read or written. Sharing memory has benefits, but uncontrolled sharing makes it too easy to make a mistake. If several threads have their programs and data stored in the same physical memory, then the threads of each module have access to every other module’s data. In fact, a simple programming error (e.g., the program computes the wrong address) can result in a STORE instruction overwriting another module’s data or a JMP instruction executing procedures of another module. Thus, without a memory enforcement mechanism we have, at best, soft modularity. In addition, the physical memory and address space may be too small to fit the applications, requiring the applications to manage the memory carefully.
为了强制模块化,我们必须确保一个模块的线程不会意外覆盖另一个模块的数据。为此,我们为每个模块提供自己的虚拟内存,如图5.2所示。虚拟内存可以为每个模块提供自己的虚拟地址空间,该空间具有自己的虚拟地址。也就是说, JMP、LOAD和STORE指令的参数都是虚拟地址,由新的硬件设备(称为虚拟内存管理器)将其转换为物理地址。如果每个模块都有自己的虚拟地址空间,那么模块只能命名自己的物理内存,而不能存储到另一个模块的内存中。如果模块的线程意外计算出错误的虚拟地址并存储到该虚拟地址,则只会影响该模块。
To enforce modularity, we must ensure that the threads of one module cannot overwrite the data of another module by accident. To do so, we give each module its own virtual memory, as Figure 5.2 illustrates. Virtual memory can provide each module with its own virtual address space, which has its own virtual addresses. That is, the arguments to JMP, LOAD, and STORE instructions are all virtual addresses, which a new hardware gadget (called a virtual memory manager) translates to physical addresses. If each module has its own virtual address space, then a module can name only its own physical memory and cannot store to the memory of another module. If a thread of a module by accident calculates an incorrect virtual address and stores to that virtual address, it will affect only that module.
利用线程和虚拟内存,我们可以为每个模块创建一个虚拟计算机。每个模块都有一个或多个线程来执行模块的代码。一个模块的线程共享一个虚拟地址内存,默认情况下其他模块的线程无法触及该内存。
With threads and virtual memory, we can create a virtual computer for each module. Each module has one or more threads that execute the code of the module. The threads of one module share a single virtual address memory that threads of other modules by default cannot touch.
为了允许虚拟计算机上的客户端和服务模块进行通信,我们引入了带有有限消息缓冲区的SEND和RECEIVE。线程可以调用SEND,它会尝试将提供的消息插入有限消息缓冲区。如果有限缓冲区已满,则发送线程将等待,直到有限缓冲区中有空间。线程调用RECEIVE以从缓冲区中检索消息;如果没有消息,则调用线程将等待。使用SEND、RECEIVE和有限缓冲区,我们可以实现远程过程调用,并在同一物理计算机上运行的不同虚拟计算机上的模块之间实施强大的模块化。
To allow client and service modules on virtual computers to communicate, we introduce SEND and RECEIVE with a bounded buffer of messages. A thread can invoke SEND, which attempts to insert the supplied message into a bounded buffer of messages. If the bounded buffer is full, the sending thread waits until there is space in the bounded buffer. A thread invokes RECEIVE to retrieve a message from the buffer; if there are no messages, the calling thread waits. Using SEND, RECEIVE, and bounded buffers, we can implement remote procedure calls and enforce strong modularity between modules on different virtual computers running on the same physical computer.
为了使抽象具体化,本章开发了一个提供抽象的最小操作系统(其接口见表5.1)。这个最小设计展示了现有操作系统中的许多机制,但为了解释简单,它没有描述任何现有系统。现有系统经过多年发展,不断吸收新的想法。因此,很少有现有系统提供干净、简单的设计示例。此外,完整的操作系统包括许多服务(例如文件系统、窗口系统等),这些服务未包含在本章描述的最小操作系统中。
To make the abstractions concrete, this chapter develops a minimal operating system that provides the abstractions (see Table 5.1 for its interface). This minimal design exhibits many of the mechanisms that are found in existing operating systems, but to keep the explanation simple it doesn’t describe any existing system. Existing systems have evolved over many years, incorporating new ideas as they came along. As a result, few existing systems provide an example of a clean, simple design. In addition, a complete operating system includes many services (such as a file system, a window system, etc.) that are not included in the minimal operating system described in this chapter.
表 5.1.本章开发的界面
Table 5.1. The Interface Developed in this Chapter
| 抽象 | 程序 |
记忆 |
创建地址 空间 删除 地址空间 分配块 释放块 映射 取消映射 |
解释器 |
分配 线程 退出 线程 销毁 线程 屈服 等待 预售 票 获取 释放 |
沟通 |
分配_BOUNDED_BUFFER 释放_BOUNDED_BUFFER 发送 接收 |
上一节简要介绍了三个高级抽象,用于虚拟化处理器、内存和链接以强制模块化。另一种方法是提供与某些物理硬件相同的接口。在这种方法中,可以通过为每个应用程序提供其自己的物理硬件实例来强制模块化。
The previous section described briefly three high-level abstractions to virtualize processors, memory, and links to enforce modularity. An alternative approach is to provide an interface that is identical to some physical hardware. In this approach, one can enforce modularity by providing each application with its own instance of the physical hardware.
这种方法可以使用一种称为模拟的技术来实现。模拟可以非常忠实地模拟某些物理硬件,以至于模拟的硬件可以运行物理硬件可以运行的任何软件。例如,Apple Inc. 已成功使用模拟来吸引客户使用新的硬件设计。Apple 使用名为 Classic 的程序在 PowerPC 处理器上模拟 Motorola Inc. 的 68030 处理器,最近又使用名为 Rosetta 的程序在 Intel Inc. 的 x86 处理器上模拟 PowerPC 处理器。再举一个例子,一些处理器在处理器内部包含一个微码解释器,以模拟其他处理器的指令或同一处理器的旧版本的指令。开发新处理器的供应商首先为其编写一个模拟器并在某个现有处理器上运行该模拟器也是标准做法。这种方法允许在制造新处理器的芯片之前开始软件开发,当芯片确实可用时,模拟器可充当一种规范,用于调试芯片。
This approach can be implemented using a technique called emulation. Emulation simulates some physical hardware so faithfully that the emulated hardware can run any software the physical hardware can. For example, Apple Inc. has used emulation successfully to move customers to new hardware designs. Apple used a program named Classic to emulate Motorola Inc.’s 68030 processor on the PowerPC processor and more recently used a program named Rosetta to emulate the PowerPC processor on Intel Inc.’s x86 processor. As another example, some processors include a microcode interpreter inside the processor to simulate instructions of other processors or instructions from older versions of the same processor. It is also standard practice for a vendor developing a new processor to start by writing an emulator for it and running the emulator on some already existing processor. This approach allows software development to begin before the chip for the new processor is manufactured, and when the chip does become available, the emulator acts as a kind of specification against which to debug the chip.
软件仿真通常很慢,因为用软件解释模拟机器的指令会产生很大的开销。查看图 2.5中的解释器结构,很容易看出,解码模拟指令、执行其操作以及更新模拟处理器的状态可能需要执行模拟的处理器上的数十条指令。因此,软件仿真的性能可能会降低 10 倍,设计人员必须努力才能做得更好。
Emulation in software is typically slow because interpreting the instructions of the emulated machine in software has substantial overhead. Looking at the structure of an interpreter in Figure 2.5, it is easy to see that decoding the simulated instruction, performing its operation, and updating the state of the simulated processor can take tens of instructions on the processor that performs the emulation. As a result, emulation in software can cost a factor 10 in performance and a designer must work hard to do better.
一种专门的快速仿真方法是使用虚拟机。在这种方法中,尽可能多地使用物理处理器来实现其自身的许多虚拟实例。也就是说,虚拟机使用物理机器M来模拟机器M的许多实例。这种方法失去了一般仿真的可移植性,但提供了更好的性能。操作系统中提供虚拟机的部分通常称为虚拟机监视器。本章的第 5.8 节更详细地讨论了虚拟机和虚拟机监视器。在内部,虚拟机监视器通常使用有界缓冲区、虚拟内存和线程,这是本章的主要主题。
A specialized approach to fast emulation is using virtual machines. In this approach, a physical processor is used as much as possible to implement many virtual instances of itself. That is, virtual machines emulate many instances of a machine M using a physical machine M. This approach loses the portability of general emulation but provides better performance. The part of the operating system that provides virtual machines is often called a virtual machine monitor. Section 5.8 of this chapter discusses virtual machines and virtual machine monitors in more detail. Internally, a virtual machine monitor typically uses bounded buffers, virtual memory, and threads, the main topics of this chapter.
本章逐步开发提供虚拟计算机所需的工具。我们首先假设物理处理器的数量多于线程的数量,并且操作系统可以为每个线程分配自己的物理处理器。我们甚至假设每个中断处理程序都有自己的物理处理器,因此当发生中断时,处理程序将在专用处理器上运行。图 5.4显示了图 2.2的修改版本,其中每个线程都有自己的处理器。再次考虑我们想在一台计算机上运行以下五个模块的示例:文本编辑器、电子邮件阅读器、键盘管理器、窗口服务和文件服务。例如,处理器 1 可能运行文本编辑器线程。处理器 2 可能运行电子邮件阅读器线程。窗口管理器可能每个窗口都有一个线程,每个线程在单独的处理器上运行。类似地,文件服务可能有多个线程,每个线程在单独的处理器上运行。线程的LOAD和STORE指令引用命名各种控制器的内存位置或寄存器的地址。也就是说,线程共享内存和控制器。
This chapter gradually develops the tools needed to provide a virtual computer. We start out assuming that there are more physical processors than threads and that the operating system can allocate each thread its own physical processor. We will even assume that each interrupt handler has its own physical processor, so when an interrupt occurs, the handler runs on that dedicated processor. Figure 5.4 shows a modified version of Figure 2.2, in which each thread has its own processor. Consider again the example that we would like to run the following five modules on a single computer: a text editor, an e-mail reader, a keyboard manager, the window service, and the file service. Processor 1, for example, might run the text editor thread. Processor 2 might run the e-mail reader thread. The window manager might have one thread per window, each running on a separate processor. Similarly, the file service might have several threads, each running on a separate processor. The LOAD and STORE instructions of threads refer to addresses that name memory locations or registers of the various controllers. That is, threads share the memories and controllers.
图 5.4一台计算机由多个硬件模块通过共享总线连接。软件模块的每个线程都有自己的处理器。
Figure 5.4 A computer with several hardware modules connected by a shared bus. Each thread of the software modules has its own processor allocated to it.
鉴于这种设置,第 5.2 节将展示如何在具有多个处理器和单个地址空间的计算机中实现客户端/服务器组织,方法是允许不同模块的线程通过有界缓冲区进行通信。此实现利用了计算机中的处理器可以通过共享内存相互交互的事实。这种能力将证明对实现虚拟通信链路很有用,但这种通过共享内存进行的不受约束的交互将严重损害模块化。因此,第 5.3 节将调整此假设,以展示如何在不同模块使用的内存区域之间提供和强制设置壁垒,以限制和控制内存共享。
Given this setup, Section 5.2 shows how the client/server organization can be implemented in a computer with many processors and a single address space by allowing the threads of different modules to communicate through a bounded buffer. This implementation takes advantage of the fact that processors within a computer can interact with one another through a shared memory. That ability will prove useful to implement virtual communication links, but such unconstrained interaction through shared memory would drastically compromise modularity. For this reason, Section 5.3 will adjust this assumption to show how to provide and enforce walls between the memory regions used by different modules to restrict and control sharing of memory.
本章的5.4、5.5和5.6节删除了5.2和5.3节中提出的设计限制。在5.4 节中,我们将删除处理器必须共享一个大型地址空间的限制,并为每个模块提供自己的虚拟内存,同时仍允许受控共享。在5.5节中,我们删除了每个线程必须拥有自己的物理处理器的限制,同时仍确保没有线程可以非自愿地停止其他线程的进程。最后,在5.6 节中,我们删除了线程必须连续使用物理处理器来测试另一个线程是否已发送消息的限制。
Sections 5.4, 5.5, and 5.6 of this chapter remove restrictions on the design presented in Sections 5.2 and 5.3. In Section 5.4 we will remove the restriction that processors must share one single, large address space, and provide each module with its own virtual memory, while still allowing controlled sharing. In Section 5.5 we remove the restriction that each thread must have its own physical processor while still ensuring that no thread can halt the progress of other threads involuntarily. Finally, in Section 5.6 we remove the restriction that a thread must use a physical processor continuously to test if another thread has sent a message.
本章介绍的操作系统、线程管理器、虚拟内存管理器以及具有有界缓冲区的SEND和RECEIVE比当代计算机系统中的设计要简单。原因之一是大多数当代设计随着技术的变化而不断发展,同时也允许用户继续运行旧程序。作为这种演变的一个例子,第 5.7 节简要介绍了 Intel x86 处理器的历史,这是一种广泛使用的通用处理器设计,多年来,它为强制模块化提供了越来越多的支持。
The operating system, thread manager, virtual memory manager, and SEND and RECEIVE with bounded buffers presented in this chapter are less complex than the designs found in contemporary computer systems. One reason is that most contemporary designs have evolved over time with changing technologies, while also allowing users to continue to run old programs. As an example of this evolution, Section 5.7 briefly describes the history of the Intel x86 processor, a widely used general-purpose processor design that has, over the years, provided increasing support for enforced modularity.
操作系统设计人员已经开发了许多虚拟通信链路的抽象。一种流行的抽象是管道 [进一步阅读建议 2.2.1和2.2.2 ],它允许两个程序使用来自文件系统调用接口的过程进行通信。由于具有有界缓冲区的SEND和RECEIVE直接镜像通信链路,因此我们在本章中更详细地描述它们。具有有界缓冲区的SEND和RECEIVE的实现也镜像套接字的实现,套接字是UNIX和 Microsoft Windows等操作系统中提供的虚拟链路接口。
Operating systems designers have developed many abstractions for virtual communication links. One popular abstraction is pipes [Suggestions for Further Reading 2.2.1 and 2.2.2], which allow two programs to communicate using procedures from the file system call interface. Because SEND and RECEIVE with a bounded buffer mirror a communication link directly, we describe them in more detail in this chapter. The implementation of SEND and RECEIVE with a bounded buffer also mirrors implementations of sockets, an interface for virtual links provided in operating systems such as UNIX and Microsoft Windows.
使用有界缓冲区实现SEND和RECEIVE 的主要挑战在于,多个线程(可能在不同的物理处理器上并行运行)可能会同时从同一有界缓冲区添加和删除消息。为了确保正确性,实现必须协调这些更新。本节将详细介绍有界缓冲区,并介绍一些协调并发操作的技术。
The main challenge in implementing SEND and RECEIVE with bounded buffers is that several threads, perhaps running in parallel on separate physical processors, may add and remove messages from the same bounded buffer concurrently. To ensure correctness, the implementation must coordinate these updates. This section will present bounded buffers in detail and introduce some techniques to coordinate concurrent actions.
操作系统可能会为具有有限缓冲区的SEND和RECEIVE提供以下接口:
An operating system might provide the following interface for SEND and RECEIVE with bounded buffers:
buffer ← ALLOCATE_BOUNDED_BUFFER ( n ):分配一个可容纳n条消息的有界缓冲区。
buffer ← ALLOCATE_BOUNDED_BUFFER (n): allocate a bounded buffer that can hold n messages.
DEALLOCATE_BOUNDED_BUFFER (缓冲区):释放有界缓冲区缓冲区。
DEALLOCATE_BOUNDED_BUFFER (buffer): free the bounded buffer buffer.
SEND ( buffer , message ): 如果缓冲区有空间,则将消息插入缓冲区。如果没有空间,则停止调用线程并等待,直到有空间为止。
SEND (buffer, message): if there is room in the bounded buffer buffer, insert message in the buffer. If not, stop the calling thread and wait until there is room.
message ← RECEIVE ( buffer ) :如果有界缓冲区 buffer 中有消息,则将消息返回给调用线程。如果有限缓冲区中没有消息,则停止调用线程并等待另一个线程将消息发送到缓冲区buffer。
message ← RECEIVE (buffer): if there is a message in the bounded buffer buffer, return the message to the calling thread. If there is no message in the bounded buffer, stop the calling thread and wait until another thread sends a message to buffer buffer.
具有有界缓冲区的SEND和RECEIVE允许发送和接收消息,如第 4 章所述。通过构建使用这些原语的存根,我们可以以与物理计算机之间的远程过程调用相同的方式实现同一物理计算机上的线程之间的远程过程调用。也就是说,从图 4.8中的客户端的角度来看,向本地虚拟计算机和远程物理计算机发送消息没有区别。在这两种情况下,如果客户端或服务模块由于编程错误而失败,则另一个模块需要提供恢复策略,但它不一定会失败。
SEND and RECEIVE with bounded buffers allow sending and receiving messages as described in Chapter 4. By building stubs that use these primitives, we can implement remote procedure calls between threads on the same physical computer in the same way as remote procedure calls between physical computers. That is, from the client’s point of view in Figure 4.8, there is no difference between sending a message to a local virtual computer or to a remote physical computer. In both cases, if the client or service module fails because of a programming error, then the other module needs to provide a recovery strategy, but it doesn’t necessarily fail.
有界缓冲区的实现需要发送线程和接收线程之间的协调,因为线程可能必须等待直到缓冲区空间可用或消息到达。多年来,不同领域的研究人员开发了两种截然不同的线程协调方法。一种方法通常由操作系统设计人员采用,假设程序员是一位无所不知、不会犯错的天才。另一种方法通常由数据库设计人员采用,假设程序员只是一个凡人,因此它为协调正确性提供了强大的自动支持,但灵活性有所下降。
The implementation with bounded buffers requires coordination between sending and receiving threads because a thread may have to wait until buffer space is available or until a message arrives. Two quite different approaches to thread coordination have developed over the years by researchers in different fields. One approach, usually taken by operating system designers, assumes that the programmer is an all-knowing genius who makes no mistakes. The other approach, usually taken by database designers, assumes that the programmer is a mere mortal, so it provides strong automatic support for coordination correctness, but at some cost in flexibility.
接下来的几个小节展示了天才的协调方法,这并不是因为它是解决协调问题的最佳方法,而是为了给出一些关于为什么它需要协调天才的直觉,因此应该尽可能将其分包给这样的专家。此外,要实现数据库方法,自动协调支持方法的设计者必须使用天才方法。第 9 章 [在线] 使用本章中介绍的概念为普通人实现数据库方法。
The next couple of subsections exhibit the genius approach to coordination, not because it is the best way to tackle coordination problems, but rather to give some intuition about why it requires a coordination genius, and thus should be subcontracted to such a specialist whenever possible. In addition, to implement the database approach the designer of the automatic coordination support approach must use the genius approach. Chapter 9 [on-line] uses the concepts introduced in this chapter to implement the database approach for mere mortals.
场景是,我们有两个线程(一个发送线程和一个接收线程),它们共享一个缓冲区,发送者将消息放入该缓冲区,接收者删除这些消息。为清楚起见,我们假设发送和接收线程各自分配有自己的处理器;也就是说,在本节的其余部分,我们可以将线程等同于处理器,因此线程可以以独立的速率同时进行。如前所述,第 5.5 节将探讨当我们消除该假设时会发生什么。
The scenario is that we have two threads (a sending thread and a receiving thread) that share a buffer into which the sender puts messages and the receiver removes those messages. For clarity we will assume that the sending and receiving thread each have their own processor allocated to them; that is, for the rest of this section we can equate thread with processor, and thus threads can proceed concurrently at independent rates. As mentioned earlier, Section 5.5 will explore what happens when we eliminate that assumption.
缓冲区是有界的,也就是说它的大小是固定的。为了确保缓冲区不会溢出,当缓冲区中的消息数量达到某个预定义的限制时,发送线程应该暂停将消息放入缓冲区。当发生这种情况时,发送者必须等到接收者已经消费了一些消息。
The buffer is bounded, which means that it has a fixed size. To ensure that the buffer doesn’t overflow, the sending thread should hold off putting messages into the buffer when the number of messages there reaches some predefined limit. When that happens, the sender must wait until the receiver has consumed some messages.
两个线程之间共享有界缓冲区的问题是生产者和消费者问题的一个实例。为了正确操作,消费者和生产者必须协调他们的活动。在我们的示例中,约束是生产者必须先将消息添加到共享缓冲区,然后消费者才能将其移除,并且生产者必须等待消费者在缓冲区填满时赶上来。这种协调是序列协调的一个例子:线程之间的协调约束,规定为了正确性,一个线程中的事件必须先于另一个线程中的事件。
The problem of sharing a bounded buffer between two threads is an instance of the producer and consumer problem. For correct operation, the consumer and the producer must coordinate their activities. In our example, the constraint is that the producer must first add a message to the shared buffer before the consumer can remove it and that the producer must wait for the consumer to catch up when the buffer fills up. This kind of coordination is an example of sequence coordination: a coordination constraint among threads stating that, for correctness, an event in one thread must precede an event in another thread.
图 5.5显示了使用有界缓冲区实现的SEND和RECEIVE 。此实现需要做出一些微妙的假设,但在深入研究这些假设之前,让我们首先考虑一下程序的工作原理。两个线程使用N(共享有界缓冲区的大小)以及变量in(生产的物品数量)和out(消费的物品数量)实现序列协调约束。如果缓冲区包含物品(即第 10行上的in > out),则接收方可以继续消费这些物品;否则,它会循环,直到发送方将一些物品放入缓冲区。线程在不放弃其处理器的情况下等待事件的循环称为自旋循环。
Figure 5.5 shows an implementation of SEND and RECEIVE using a bounded buffer. This implementation requires making some subtle assumptions, but before diving into these assumptions let’s first consider how the program works. The two threads implement the sequence coordination constraint using N (the size of the shared bounded buffer) and the variables in (the number of items produced) and out (the number of items consumed). If the buffer contains items (i.e., in > out on line 10), then the receiver can proceed to consume the items; otherwise, it loops until the sender has put some items in the buffer. Loops in which a thread is waiting for an event without giving up its processor are called spin loops.
图 5.5使用有界缓冲区实现SEND和RECEIVE 。
Figure 5.5 An implementation of a SEND and RECEIVE using bounded buffers.
为了确保发送方在缓冲区已满时等待,发送方仅在in — out < N时才将新项目放入缓冲区(第6行);否则,它会旋转直到接收方通过消费一些项目在缓冲区中腾出空间。这种设计可确保缓冲区不会溢出。
To ensure that the sender waits when the buffer is full, the sender puts new items in the buffer only if in — out < N (line 6); otherwise, it spins until the receiver made room in the buffer by consuming some items. This design ensures that the buffer does not overflow.
该实现的正确性依赖于几个假设:
The correctness of this implementation relies on several assumptions:
1.该实现假设有一个发送线程和一个接收线程,并且只有一个线程更新每个共享变量。在程序中,只有接收线程更新out,只有发送线程更新in。如果多个线程更新同一个共享变量(例如,多个发送线程更新in或接收线程和发送线程更新一个变量),则必须协调对共享变量的更新,而此实现并未做到这一点。
1. The implementation assumes that there is one sending thread and one receiving thread and that only one thread updates each shared variable. In the program only the receiver thread updates out, and only the sender thread updates in. If several threads update the same shared variable (e.g., multiple sending threads update in or the receiving thread and the sending thread update a variable), then the updates to the shared variable must be coordinated, which this implementation doesn’t do.
这个假设体现了这样一个原则:当每个共享变量只有一个写入者时,协调是最简单的:
This assumption exemplifies the principle that coordination is simplest when each shared variable has just one writer:
单一作者原则
One-Writer Principle
如果每个变量只有一个写入者,协调就会变得更容易。
If each variable has only one writer, coordination becomes easier.
也就是说,如果可以的话,请安排您的程序,以便两个线程不会更新同一个共享变量。遵循这一原则还可以提高模块化,因为信息只朝一个方向流动:从单个写入器到读取器。在我们的实现中,out包含从接收线程流向发送器的信息,in包含从发送器线程流向接收器的信息。这种信息流限制简化了正确性论证,并且正如我们将在第 11 章 [在线] 中看到的那样,还可以增强安全性。对于有界缓冲区的实现方式,也有类似的观察结果。由于消息是固定大小的数组,因此条目仅由发送方线程写入。如果缓冲区被实现为链接列表,我们可能会遇到发送方和接收方需要同时更新共享变量(例如,指向链接列表头部的指针)的情况,然后必须协调这些更新。
That is, if you can, arrange your program so that two threads don’t update the same shared variable. Following this principle also improves modularity because information flows in only one direction: from the single writer to the reader. In our implementation, out contains information that flows from the receiver thread to the sender, and in contains information that flows from the sender thread to the receiver. This restriction of information flow simplifies correctness arguments and, as we will see in Chapter 11 [on-line], can also enhance security.A similar observation holds for the way the bounded buffer buffer is implemented. Because messages is a fixed-size array, the entries are written only by the sender thread. If the buffer had been implemented as a linked list, we might have a situation in which the sender and the receiver need to update a shared variable at the same time (e.g., the pointer to the head of the linked list) and then these updates would have to be coordinated.
2.第 6行和第 10行的自旋循环需要前面提到的假设,即发送方和接收方线程分别在专用处理器上运行。当我们在第 5.5 节中删除该假设时,我们将不得不对这些自旋循环采取一些措施。
2. The spin loops on lines 6 and 10 require the previously mentioned assumption that the sender and the receiver threads each run on a dedicated processor. When we remove that assumption in Section 5.5 we will have to do something about these spin loops.
3.此实现假设变量in和out是整数,其表示形式必须足够大,以便在缓冲区的整个生命周期内永远不会溢出。宽度为 64 或 96 位的整数可能足以满足大多数应用程序的需求。(消除此假设的另一种方法是使有界缓冲区的实现更加复杂:执行所有涉及in和out模N 的加法,并在缓冲区中保留一个额外的槽以区分满缓冲区和空缓冲区。)
3. This implementation assumes that the variables in and out are integers whose representation must be large enough that they will never overflow for the life of the buffer. Integers of width 64 or 96 bits would probably suffice for most applications. (An alternative way to remove this assumption is to make the implementation of the bounded buffer more complicated: perform all additions involving in and out modulo N, and reserve one additional slot in the buffer to distinguish a full buffer from an empty one.)
4.该实现假定共享内存为in和out提供读/写一致性(参见第 2.1.1.1 节)。也就是说,必须保证一个线程对变量in或out的LOAD 操作能够获得另一个线程对该变量的最新存储结果。
4. The implementation assumes that the shared memory provides read/write coherence (see Section 2.1.1.1) for in and out. That is, a LOAD of the variable in or out by one thread must be guaranteed to obtain the result of the most recent store to that variable by the other thread.
5.该实现假设变量in和out具有前后原子性。如果这两个变量适合 16 位或 32 位内存单元,并且可以通过单个LOAD或STORE进行读写,则此假设可能是正确的。但 64 位或 96 位整数可能需要多个内存单元。如果确实如此,则读取和写入in和out将需要多个LOAD或STORE,并且需要采取其他措施来使这些多步骤序列具有原子性。
5. The implementation assumes before-or-after atomicity for the variables in and out. If these two variables fit in a 16- or 32-bit memory cell that can be read and written with a single LOAD or STORE, this assumption is likely to be true. But a 64- or 96-bit integer would probably require multiple memory cells. If they do, reading and writing in and out would require multiple LOADs or STOREs, and additional measures will be necessary to make these multistep sequences atomic.
6.该实现假定执行语句的结果按程序顺序对其他线程可见。如果优化编译器或处理器重新排序语句以实现更好的性能,则此程序可能会运行不正确。例如,如果编译器生成的代码读取p.in一次,将其保存在临时寄存器中以供第 6行至第 8行使用,并立即更新p.in的内存副本,则接收方可能会在发送方将其消息复制到该条目之前读取共享缓冲区的第 in条目的内容。
6. The implementation assumes that the result of executing a statement becomes visible to other threads in program order. If an optimizing compiler or processor reorders statements to achieve better performance, this program could work incorrectly. For example, if the compiler generates code that reads p.in once, holds it in a temporary register for use in lines 6 through 8, and updates the memory copy of p.in immediately, then the receiver may read the contents of the inth entry of the shared buffer before the sender has copied its message into that entry.
本节的其余部分将解释当假设1(单写入器原则)和假设5 (多步LOAD和STORE序列的前后原子性)不成立时会出现什么问题,并介绍确保它们成立的技术。在第5.5 节中,我们将了解如何删除假设 2(处理器多于线程)。自始至终,我们假设假设3、4和6始终成立。
The rest of this section explains what problems occur when assumptions 1 (the one-writer principle) and 5 (before-or-after atomicity of multistep LOAD and STORE sequences) don’t hold and introduces techniques to ensure them. In Section 5.5 we will find out how to remove assumption 2 (more processors than threads). Throughout, we assume that assumptions 3, 4, and 6 always hold.
为了说明这六个假设对于保证图 5.5中程序的正确性的重要性,让我们逐个删除其中两个假设,看看到底出了什么问题。我们会发现,为了处理删除的假设,我们需要额外的机制,第 5.2.4 节将介绍这些机制。这个例子强化了并发编程需要专家关注的观察结果:只需一个细微的变化就能让正确的程序出错。
To illustrate the importance of the six assumptions in guaranteeing the correctness of the program in Figure 5.5, let’s remove two of those assumptions, one at a time, to see just what goes wrong. What we will find is that to deal with the removed assumptions we need additional mechanisms, mechanisms that Section 5.2.4 will introduce. This illustration reinforces the observation that concurrent programming needs the attention of specialists: all it takes is one subtle change to make a correct program wrong.
为了消除第一个假设,我们允许多个发送者和接收者。这种改变将违反单一写入者原则,因此我们不应该惊讶地发现它引入了错误。多个发送者和接收者在实践中很常见。例如,考虑在多个客户端之间共享的打印机。管理打印机的服务可能会收到来自多个客户端的请求。每个请求都会将文档添加到要打印文档的共享缓冲区中。在这种情况下,我们有多个发送者(将作业添加到缓冲区的线程)和一个接收者(打印机)。
To remove the first assumption, let’s allow several senders and receivers. This change will violate the one-writer principle, so we should not be surprised to find that it introduces errors. Multiple senders and receivers are common in practice. For example, consider a printer that is shared among many clients. The service managing the printer may receive requests from several clients. Each request adds a document to the shared buffer of to-be-printed documents. In this case, we have several senders (the threads adding jobs to the buffer) and one receiver (the printer).
正如我们将看到的,出现的错误很难追踪,因为它们并不总是出现。它们只出现在涉及的线程的指令的特定顺序中。因此,并发程序不仅难以正确运行,而且在出错时也难以调试。
As we will see, the errors that will manifest themselves are difficult to track down because they don’t always show up. They appear only with a particular ordering of the instructions of the threads involved. Thus, concurrent programs are not only difficult to get right, but also difficult to debug when they are wrong.
当多个发送者同时执行代码时,图 5.5中的解决方案不起作用。为了说明原因,我们假设N为 20,且缓冲区中的所有条目均为空(例如,out为 0,in为 0),并且每个线程都在自己的处理器上运行:
The solution in Figure 5.5 doesn’t work when several senders execute the code concurrently. To see why, let’s assume N is 20 and that all entries in the buffer are empty (e.g., out is 0 and in is 0), and each thread is running on its own processor:
如果两个发送线程同时运行(一个在处理器 A 上,一个在处理器 B 上),则这两个线程会以各自的速度独立发出指令。这两个处理器的速度可能不同,中断时间也可能不同,或者指令可能在一个处理器的缓存中命中,而在另一个处理器的缓存中未命中,因此无法预测线程发出的LOAD和STORE指令的相对时间。
If two sending threads run concurrently—one on processor A and one on processor B—the threads issue instructions independently of each other, at their own pace. The processors may have different speeds and take interrupts at different times, or instructions may hit in the cache on one processor and miss on another, so there is no way to predict the relative timing of the LOAD and STORE instructions that the threads issue.
此场景是异步解释器的一个实例(见第 2.1.2 节)。因此,我们不应该对两个线程的内存操作执行顺序做任何假设。在分析两个线程的并发执行时,这两个线程都执行图 5.5中的指令6到8,我们可以假设它们以某种串行顺序执行(因为总线仲裁器将对同时到达总线的任何内存操作进行排序)。但是,由于线程的相对速度是不可预测的,我们不能对序列中的顺序做任何假设。
This scenario is an instance of asynchronous interpreters (described in Section 2.1.2). Thus, we should make no assumptions about the sequence in which the memory operations of the two threads execute. When analyzing the concurrent execution of two threads, both executing instructions 6 through 8 in Figure 5.5, we can assume they execute in some serial sequence (because the bus arbiter will order any memory operations that arrive at the bus at the same time). However, because the relative speeds of the threads are unpredictable, we can make no assumptions about the order in the sequence.
我们将线程 A执行的指令6表示为“ A6 ”。使用此表示,我们可以看到一个可能的序列如下:A6、A7、A8、B6、B7、B8。在这种情况下,程序按预期运行。假设我们刚刚开始,因此变量in和out均为零。线程A在线程B执行其三条指令中的任何一条之前执行其所有三条指令。按照此顺序,线程A在条目 0 中插入一个项目,并将输入从0 递增到 1。线程B在条目 1 中添加一个项目,并将输入从1 递增到 2。
We represent the execution of instruction 6 by thread A as “A6”. Using this representation, we see that one possible sequence might be as follows: A6, A7, A8, B6, B7, B8. In this case, the program works as expected. Suppose we just started, so variables in and out are both zero. Thread A performs all of its three instructions before thread B performs any of its three instructions. With this order, thread A inserts an item in entry 0 and increments in from 0 to 1. Thread B adds an item in entry 1 and increments in from 1 to 2.
另一个可能但不理想的序列是A6、B6、B7、A7、A8、B8,它对应以下时序图:
Another possible, but undesirable, sequence is A6, B6, B7, A7, A8, B8, which corresponds to the following timing diagram:
按照此顺序,线程A在A6处发现缓冲区的条目 0 是空闲的。然后,在B6处,B也发现缓冲区条目 0 是空闲的。在B7处,B在缓冲区的条目 0 中存储一个项目。然后,A继续:在A7处,它还在条目 0 中存储一个项目,覆盖B的项目。然后,两个都增加( A8和B8 ),首先将 设置为1,然后设置为 2。因此,在该指令顺序的末尾,一个打印作业丢失(线程B的作业),并且(因为两个线程都增加)接收方将发现缓冲区中的条目 1 从未被填充过。
With this order, thread A, at A6, discovers that entry 0 of the buffer is free. Then, at B6, B also discovers that buffer entry 0 is free. At B7, B stores an item in entry 0 of buffer. Then, A proceeds: at A7 it also stores an item in entry 0, overwriting B’s item. Then, both increment in (A8 and B8), setting in first to 1 and then to 2. Thus, at the end of this order of instructions, one print job is lost (thread B’s job), and (because both threads incremented in) the receiver will find that entry 1 in the buffer was never filled in.
这种错误被称为竞争条件,因为它取决于两个线程的确切时间。无法控制是否发生错误。这很讨厌,因为有些序列会提供正确的结果,而有些序列会提供错误的结果。
This type of error is called a race condition because it depends on the exact timing of two threads. Whether or not an error happens cannot be controlled. It is nasty, since some sequences deliver a correct result and some sequences deliver an incorrect result.
更糟糕的是,调用之间的微小时间变化可能会导致不同的行为。如果我们注意到B的打印作业丢失了,我们再次运行它以查看出了什么问题,我们可能会在重试时得到正确的结果,因为指令的相对时间略有变化。特别是,如果我们在重试时添加指令(例如,用于调试),几乎可以保证时间会改变(因为线程执行了额外的指令),我们会观察到不同的行为。当调试器开始接近它们时消失的错误被通俗地称为“海森堡错误”,这是对量子力学海森堡不确定性原理的戏谑双关语。海森堡错误很难重现,这使得调试变得困难。
Worse, small timing changes between invocations might result in different behavior. If we notice that B’s print job was lost and we run it again to see what went wrong, we might get a correct result on the retry because the relative timing of the instructions has changed slightly. In particular, if we add instructions (e.g., for debugging) on the retry, it is almost guaranteed that the timing is changed (because the threads execute additional instructions) and we will observe a different behavior. Bugs that disappear when the debugger starts to close in on them are colloquially called “Heisenbugs” in a tongue-in-cheek pun on the Heisenberg uncertainty principle of quantum mechanics. Heisenbugs are difficult to reproduce, which makes debugging difficult.
竞争条件是编写并发程序的主要陷阱,也是为什么开发并发程序应该留给专家的主要原因,尽管存在帮助识别竞争的工具(例如,参见 Savage 等人 [进一步阅读建议 5.5.6 ])。并发编程很微妙。事实上,由于有多个发送方,图 5.5中的程序存在第二个竞争条件。考虑发送方执行的语句 8:
Race conditions are the primary pitfall in writing concurrent programs and the main reason why developing concurrent programs should be left to specialists, despite the existence of tools to help identifying races (e.g., see Savage et al. [Suggestions for Further Reading 5.5.6]). Concurrent programming is subtle. In fact, with several senders the program of Figure 5.5 has a second race condition. Consider the statement 8 that the senders execute:
在←在+ 1
in ← in + 1
实际上,一个线程执行该语句分为三个步骤,可以表示如下:
In reality, a thread executes this statement in three separate steps, which can be expressed as follows:
1 LOAD in, R0 //将in 的值加载到寄存器中
1 LOAD in, R0 // Load the value of in into a register
2 ADD R0 , 1 // 增加
2 ADD R0 , 1 // Increment
3 STORE R0 , in // 将结果存储回in
3 STORE R0 , in // Store result back to in
假设有两个发送线程同时运行,分别是线程A和B。线程的指令可能按照A1、A2、A3、B1、B2、B3的顺序执行,这对应于以下时序图:
Consider two sending threads running simultaneously, threads A and B, respectively. The instructions of the threads might execute in the sequence A1, A2, A3, B1, B2, B3, which corresponds to the following timing diagram:
在这种情况下,正如程序员所期望的那样, in增加了 2。
In this case in is incremented by two, as the programmer intended.
但现在考虑执行序列A1,B1,A2,A3,B2,B3,它对应以下时序图:
But now consider the execution sequence A1, B1, A2, A3, B2, B3, which corresponds to the following timing diagram:
当两个线程完成时,这种内存引用顺序在 中仅增加了 1。在A1处,线程A使用中的值(即 0)加载其线程的R0寄存器。在B1处,线程B做同样的事情,将值 0加载其线程的寄存器R0 。然后,在A2处,线程 A 计算R0中的新值,并在A3处使用值 1更新中。接下来,在B2和B3处,线程B做同样的事情:它计算R0中的新值并使用值 1更新中。因此,中最终包含 1 而不是预期的 2。任何时候两个线程同时更新一个共享变量(即违反单一写入者原则),都可能产生竞争条件。
When the two threads finish, this ordering of memory references has increased in by only 1. At A1, thread A loads the R0 register of its thread with the value of in, which is 0. At B1, thread B does exactly the same thing, loading its thread’s register R0 with the value 0. Then, at A2, thread A computes the new value in R0 and at A3 updates in with the value 1. Next, at B2 and B3, thread B does the same thing: it computes the new value in R0 and updates in with the value 1. Thus, in ends up containing 1 instead of the intended 2. Any time two threads update a shared variable concurrently (i.e., the one-writer principle is violated), a race condition is possible.
我们通过允许多个发送者导致了第二个竞争条件。但是,即使只有一个发送者和一个接收者,对变量in和out的操作也存在潜在的竞争,并且我们删除了假设5(前后原子性要求)。假设我们想将in和out设为长整数类型,这样这两个变量溢出的风险就很小。在这种情况下,in和out分别跨越两个内存单元而不是一个,并且对in和out的更新不再是原子操作。这种变化又引发了另一场竞争。
We caused this second race condition by allowing multiple senders. But the manipulation of the variables in and out also has a potential race even if there is only one sender and one receiver, and we remove assumption 5 (the before-or-after atomicity requirement). Let’s assume that we want to make in and out of type long integer so that there is little risk of overflowing those two variables. In that case, in and out each span two memory cells instead of one, and updates to in and out are no longer atomic operations. That change creates yet another race.
如果in是一个长整数,那么更新in需要两个指令:
If in is a long integer, then updating in would require two instructions:
1 STORE R0 , in + 1 // 更新in 的最低有效字
1 STORE R0, in+ 1 // Update the least-significant word of in
2 STORE R1, in // 更新in的最高有效字
2 STORE R1, in // Update the most-significant word of in
读入也需要两条指令:
To read in would also require two instructions:
3 LOAD in + 1 , R0 //将in 的最低有效字加载到寄存器中
3 LOAD in+ 1, R0 // Load the least-significant word of in into a register
4 LOAD in, R1 // 将in 的最高有效字加载到寄存器中
4 LOAD in, R1 // Load the most-significant word of in into a register
如果发送方执行指令1和2 的时间与接收方执行指令3和4 的时间大致相同,则可能出现竞争条件。假设两个线程调用SEND和RECEIVE 2 32 -1 次,并完美交错调用。此时缓冲区中没有消息,并且使用大端表示法,in = out =十六进制00000000FFFFFFFF。
If the sender executes instructions 1 and 2 at about the same time that the receiver executes instructions 3 and 4, a race condition could manifest itself. Let’s assume that two threads call SEND and RECEIVE 232-1 times, and interleave their calls perfectly. At this point there are no messages in the buffer and in = out = 00000000FFFFFFFFhex using big-endian notation.
让我们考虑这样的场景:线程A刚刚将一条消息添加到缓冲区,已将in读入R0和R1(在指令3和4处),已在寄存器R0和R1中计算出in的新值,并已执行指令1将in更新到内存中。但在 A 执行指令2之前,线程B添加了一条消息:
Let’s consider the scenario in which thread A has just added a message to the buffer, has read in into R0 and R1 (at instructions 3 and 4), has computed the new value for in in the registers R0 and R1, and has executed instruction 1 to update in to memory. But before A executes instruction 2, thread B adds a message:
在这种情况下,程序运行不正确,因为A已在缓冲区的第 15 个条目中存储了一条消息(十六进制0000000FFFFFFFF模20 = 15),B在条目 0 中存储了一条消息,并且A完成了中的更新,将设置为 0000000100000000十六进制。B在条目 0 中的消息将丢失,因为条目 0 将被下一个SEND调用者覆盖。
In this case, the program works incorrectly because A has stored a message in entry 15 of the buffer (00000000FFFFFFFFhexmodulo 20 = 15), B stores a message in entry 0, and A completes the update of in, which sets in to 0000000100000000hex. B’s message in entry 0 will be lost because entry 0 will be overwritten by the next caller to SEND.
竞争条件在复杂系统中并不罕见。CTSS 和 Therac-25 机器中就曾发生过两次臭名昭著的竞争条件。在早期操作系统 CTSS 中,所有正在运行的文本编辑器实例都使用相同的临时文件名称。在某个时候,两个管理员同时编辑包含当天消息的文件和包含密码的文件。这两个文件的内容最终被交换(有关详细信息,请参阅第 11.11.2 节 [在线]):当用户登录 CTSS 时,它会将所有其他用户的密码短语显示为当天消息。
Race conditions are not uncommon in complex systems. Two notorious ones occurred in CTSS and in the Therac-25 machine. In CTSS, an early operating system, all running instances of a text editor used the same name for temporary files. At some point, two administrators were concurrently editing the file with the message of the day and a file containing passwords. The content of the two files ended up being exchanged (see Section 11.11.2 [on-line] for the details): when users logged into CTSS, it displayed the pass phrases of all other users as the message of the day.
Therac-25 是一种向人类患者提供医疗辐射的机器 [进一步阅读建议 1.9.5 ]。线程和操作员之间的竞争条件导致设置了错误的辐射强度:结果导致一些患者死亡。修理工无法重现该问题,因为他的打字速度比经验丰富的机器操作员慢。
The Therac-25 is a machine that delivers medical irradiation to human patients [Suggestions for Further Reading 1.9.5]. A race condition between a thread and the operator allowed an incorrect radiation intensity to be set: as a result, some patients died. The repairman could not reproduce the problem, since he typed more slowly than the more experienced operator of the machine.
问题集4、5和6要求读者在几个小的并发代码片段中查找竞争条件。
Problem sets 4, 5, and 6 ask the reader to find race conditions in a few small, concurrent code fragments.
从上一节的示例中我们可以看出,图 5.5中的程序是经过精心编写的,因此它没有违反假设1和5。如果我们对程序进行细微修改或以与预期用途稍有不同的方式使用该程序,我们就会违反假设,程序就会出现竞争条件。我们希望有一种技术可以帮助开发人员系统地避免竞争条件。本节介绍了一种称为锁的机制,设计人员可以使多步骤操作表现得像单步操作一样。通过谨慎使用锁,我们可以修改图 5.5中的程序,使其强制执行假设1和5,从而系统地避免竞争条件。
From the examples in the preceding section we can see that the program in Figure 5.5 was carefully written so that it didn’t violate assumptions 1 and 5. If we make slight modifications to the program or use the program in slightly different ways than it was intended to be used, we violate the assumptions and the program exhibits race conditions. We would like a technique by which a developer can systematically avoid race conditions. This section introduces a mechanism called a lock, with which a designer can make a multistep operation behave like a single-step operation. By using locks carefully, we can modify the program in Figure 5.5 so that it enforces assumptions 1 and 5, and thus avoids the race conditions systematically.
锁是一种共享变量,充当协调其他共享变量使用的标志。为了使用锁,我们引入了两个新的原语:ACQUIRE和RELEASE,它们都以锁的名称作为参数。一个线程可以ACQUIRE一个锁,持有一段时间,然后RELEASE它。当一个线程持有锁时,尝试获取同一锁的其他线程将等待,直到第一个线程释放锁。通过用ACQUIRE和RELEASE包围涉及共享变量的多步操作,设计人员可以使对共享变量的多步操作表现得像单步操作一样,并避免多步操作的不良交错。
A lock is a shared variable that acts as a flag to coordinate usage of other shared variables. To work with locks we introduce two new primitives: ACQUIRE and RELEASE, both of which take the name of a lock as an argument. A thread may ACQUIRE a lock, hold it for a while, and then RELEASE it. While a thread is holding a lock, other threads that attempt to acquire that same lock will wait until the first thread releases the lock. By surrounding multistep operations involving shared variables with ACQUIRE and RELEASE, the designer can make the multistep operation on shared variables behave like a single-step operation and avoid undesirable interleavings of multistep operations.
图 5.6显示了图 5.5中的代码,其中添加了ACQUIRE和RELEASE调用。修改后的程序仅使用一个锁 ( buffer_lock ),因为只有一个数据结构必须保护。锁保证程序在有多个发送方和接收方时正常工作。它还保证in和out为长整数时的正确性。也就是说,图 5.5中的程序正确的两个假设现在由程序本身保证。
Figure 5.6 shows the code of Figure 5.5 with the addition of ACQUIRE and RELEASE invocations. The modified program uses only one lock (buffer_lock) because there is a single data structure that must protected. The lock guarantees that the program works correctly when there are several senders and receivers. It also guarantees correctness when in and out are long integers. That is, the two assumptions under which the program of Figure 5.5 is correct are now guaranteed by the program itself.
图 5.6 SEND和RECEIVE的实现添加了锁,以便可以有多个发送者和接收者。第 9 行和第 10 行的RELEASE和ACQUIRE在第 5.25 节中进行了解释。
Figure 5.6 An implementation of SEND and RECEIVE that adds locks so that there can be multiple senders and receivers. The RELEASE and ACQUIRE on lines 9 and 10 are explained in Section 5.25.
ACQUIRE和RELEASE调用使共享变量p . in和p . out的读写操作像单步操作一样进行。ACQUIRE 和 RELEASE 设置的锁确保测试和缓冲区操作作为一个不可分割的操作执行;因此,不会发生不良的交错和竞争。如果两个线程试图同时执行 ACQUIRE 和 RELEASE 之间的多步操作,则一个线程会获取锁并完成完整的多步操作,然后另一个线程才能开始操作。ACQUIRE和RELEASE原语具有动态实现这些变量上的单写入者原则的效果:它们确保在任何时刻只有一个写入者,但写入者的身份可能会发生变化。
The ACQUIRE and RELEASE invocations make the reads and writes of the shared variables p.in and p.out behave like a single-step operation. The lock set by ACQUIRE and RELEASE ensures the test, and manipulation of the buffer is executed as one indivisible action; thus, no undesirable interleavings and races can happen. If two threads attempt to execute the multistep operation between ACQUIRE and RELEASE concurrently, one thread acquires the lock and finishes the complete multistep operation before the other thread starts on the operation. The ACQUIRE and RELEASE primitives have the effect of dynamically implementing the one-writer principle on those variables: they ensure there is only a single writer at any instant, but the identity of the writer can change.
重要的是要记住,当一个线程获取锁时,该锁应该保护的共享变量不会机械地被其他线程访问。任何线程仍然可以在不获取锁的情况下读取或写入这些变量。锁变量仅充当标志,并且为了正确协调,所有线程都必须遵守额外的约定:除非它们持有锁,否则它们不得对共享变量执行操作。如果任何线程未能遵守该约定,则可能会出现不良的交错和竞争。
It is important to keep in mind that when a thread acquires a lock, the shared variables that the lock is supposed to protect are not mechanically protected from access by other threads. Any thread can still read or write those variables without acquiring the lock. The lock variable merely acts as a flag, and for correct coordination all threads must honor an additional convention: they must not perform operations on shared variables unless they hold the lock. If any thread fails to honor that convention, there may be undesirable interleavings and races.
为了确保并发线程存在时的正确性,设计人员必须识别所有潜在的竞争,并小心地插入ACQUIRE和RELEASE调用来防止它们。如果锁定语句不能确保对共享变量的多步操作显示为单步操作,则程序可能存在竞争条件。例如,如果在图 5.6的SEND过程中,程序员将ACQUIRE和RELEASE语句放在第 11行到第 12行的语句周围,则可能会发生几种竞争条件。如果锁不能保护对缓冲区是否有空间的测试(第8行),则可以通过对SEND的多个并发调用将只有一个可用空间的缓冲区附加到其中。此外,在p.in与p.out的比较期间, in和out 的前后原子性(假设5)可能会被违反,因此第 5.2.3 节中描述的竞争仍然可能发生。使用锁进行编程需要非常注意细节。第 9 章[在线]将探讨允许设计人员系统地确保涉及共享变量的多步操作的正确性的方案。
To ensure correctness in the presence of concurrent threads, a designer must identify all potential races and carefully insert invocations of ACQUIRE and RELEASE to prevent them. If the locking statements don’t ensure that multistep operations on shared variables appear as single-step operations, then the program may have a race condition. For example, if in the SEND procedure of Figure 5.6 the programmer places the ACQUIRE and RELEASE statements around just the statements on lines 11 through 12, then several race conditions may happen. If the lock doesn’t protect the test of whether there’s space in the buffer (line 8), a buffer with only one space free could be appended to by multiple concurrent invocations to SEND. Also, before-or-after atomicity for in and out (assumption 5) could be violated during the comparisons of p.in with p.out, so the race described in Section 5.2.3 could still occur. Programming with locks requires great attention to detail. Chapter 9 [on-line] will explore schemes that allow the designer to systematically ensure correctness for multistep operations involving shared variables.
锁可用于实现前后原子性。在线程持有保护一个或多个共享变量的锁期间,它可以对这些共享变量执行多步骤操作。由于遵守锁协议的其他线程不会同时读取或写入任何共享变量,因此从它们的角度来看,第一个线程的多个步骤似乎是不可分割的:在获取锁之前,没有发生任何步骤;释放锁之后,所有步骤都已完成。并发线程的任何操作都必须完全在前后原子操作之前或之后发生。
A lock can be used to implement before-or-after atomicity. During the time that a thread holds a lock that protects one or more shared variables, it can perform a multistep operation on these shared variables. Because other threads that honor the lock protocol will not concurrently read or write any of the shared variables, from their point of view the multiple steps of the first thread appear to happen indivisibly: before the lock is acquired, none of the steps have occurred; after the lock is released all of them are complete. Any operation by a concurrent thread must happen either completely before or completely after the before-or-after atomic action.
在不同情况下,人们已经认识到了前后原子性的必要性,因此该概念和前后原子操作有各种名称。数据库文献使用术语隔离和隔离操作;操作系统文献使用术语互斥和临界区;计算机体系结构文献使用术语原子性和原子操作。由于第 9 章 [在线] 介绍了第二种原子性,因此本文使用限定术语“前后原子性”以确保准确性以及其自定义和助记功能。
The need for before-or-after atomicity has been realized in different contexts, and as a result that concept and before-or-after atomic actions are known by various names. The database literature uses the terms isolation and isolated actions; the operating system literature uses the terms mutual exclusion and critical sections; and the computer architecture literature uses the terms atomicity and atomic actions. Because Chapter 9 [on-line] introduces a second kind of atomicity, this text uses the qualified term “before-or-after atomicity” for precision as well as for its self-defining and mnemonic features.
总体而言,在计算机科学界,人们已经做了大量工作来寻找程序中竞争条件的方法以及从一开始就避免竞争条件的方法。本文介绍了并发编程的基本思想,但有兴趣的读者可以查阅相关文献以了解更多信息。
In general, in the computer science community, a tremendous amount of work has been done on approaches to finding race conditions in programs and on approaches to avoid them in the first place. This text introduces the fundamental ideas in concurrent programming, but the interested reader is encouraged to explore the literature to learn more.
ACQUIRE和RELEASE的通常实现保证每次只有一个线程可以获取给定的锁。此要求称为单获取协议。如果程序员更详细地了解如何使用受保护的共享变量,则更宽松的协议可能能够允许更多的并发性。例如,第 9.5.4 节描述了一个多读取器、单写入器的锁定协议。
The usual implementation of ACQUIRE and RELEASE guarantees that only a single thread can acquire a given lock at any one time. This requirement is called the single-acquire protocol. If the programmer knows more details about how the protected shared variables will be used, a more relaxed protocol may be able to allow more concurrency. For example, Section 9.5.4 describes a multiple-reader, single-writer locking protocol.
在具有许多共享数据结构的大型程序中,程序员通常会使用多个锁。例如,如果多个数据结构中的每一个都被不同的操作使用,那么我们可能会为每个共享数据结构引入一个单独的锁。这样,使用不同共享数据结构的操作就可以并发进行。如果程序只使用一个锁来保护所有数据结构,那么所有操作都将由该锁序列化。另一方面,使用多个锁会使理解程序的复杂性再增加一个等级,正如我们接下来将看到的那样。
In larger programs with many shared data structures, a programmer often uses several locks. For example, if each of the several data structures is used by different operations, then we might introduce a separate lock for each shared data structure. That way, the operations that use different shared data structures can proceed concurrently. If the program used just one lock to protect all of the data structures, then all operations would be serialized by the lock. On the other hand, using several locks raises the complexity of understanding a program by another notch, as we will see next.
问题集4和5探索了ACQUIRE和RELEASE语句的几个可能位置,试图消除竞争条件,同时仍允许并发执行某些操作。Birrell 的教程 [进一步阅读建议 5.3.1 ] 很好地介绍了如何使用线程和锁编写并发程序。
Problem sets 4 and 5 explore several possible locations for ACQUIRE and RELEASE statements in an attempt to remove a race condition while still allowing for concurrent execution of some operations. Birrell’s tutorial [Suggestions for Further Reading 5.3.1] provides a nice introduction on how to write concurrent programs with threads and locks.
程序员必须谨慎使用锁,因为很容易产生与竞争条件一样糟糕的其他不良情况。例如,使用锁,程序员可能会产生死锁,这是一组线程之间不良的交互,其中每个线程都在等待组中的其他线程取得进展。
A programmer must use locks with care, because it is easy to create other undesirable situations that are as bad as race conditions. For example, using locks, a programmer can create a deadlock, which is an undesirable interaction among a group of threads in which each thread is waiting for some other thread in the group to make progress.
考虑两个线程 A 和 B,它们都必须获取两个锁L1和L2,然后才能继续执行任务:
Consider two threads, A and B, that both must acquire two locks, L1 and L2, before they can proceed with their task:
| 线程 A | 线程 B |
| 取得( L1 ) | 习得( L2 ) |
| 习得( L2 ) | 取得( L1 ) |
该代码片段存在导致死锁的竞争条件,如下面的时序图所示:
This code fragment has a race condition that results in deadlock, as shown in the following timing diagram:
线程 A 无法继续前进,因为线程 B 已经获得了L2,而线程 B 也无法继续前进,因为线程 A 已经获得了L1。线程之间陷入了致命的纠缠。
Thread A cannot make forward progress because thread B has acquired L2, and thread B cannot make forward progress because thread A has acquired L1. The threads are in a deadly embrace.
如果我们修改了代码,使两个线程以相同的顺序获取锁(先L1再L2,或反之亦然),那么就不会发生死锁。同样,语句顺序的微小变化可能会导致好或坏的行为。
If we had modified the code so that both threads acquire the locks in the same order (L1 and then L2, or vice versa), then no deadlock could have occurred. Again, small changes in the order of statements can result in good or bad behavior.
表示死锁的一种方便方法是使用等待图。等待图中的节点是线程和资源(例如锁)。当线程获取锁时,它会从锁节点向线程节点插入一条有向边。当线程必须等待资源时,它会从线程节点向资源节点插入另一条有向边。例如,线程 A、B 和锁L1和L2的竞争条件导致以下等待图:
A convenient way to represent deadlocks is using a wait-for graph. The nodes in a wait-for graph are threads and resources such as locks. When a thread acquires a lock, it inserts a directed edge from the lock node to the thread node. When a thread must wait for a resource, it inserts another directed edge from the thread node to the resource node. As an example, the race condition with threads A, B, and locks L1 and L2 results in the following wait-for graph:
当线程 A 获取锁L1时,它会插入箭头 1。当线程 B 获取锁L2时,它会插入箭头 2。当线程 A 必须等待锁L2时,它会插入箭头 3。当线程 B 尝试获取锁L1但必须等待时,它会插入箭头 4。当线程必须等待时,我们会检查等待图是否包含循环。循环表示死锁:每个人都在等待其他人释放资源。通常,当且仅当等待图包含循环时,线程才会死锁。
When thread A acquires lock L1, it inserts arrow 1. When thread B acquires lock L2, it inserts arrow 2. When thread A must wait for lock L2, it inserts arrow 3. When thread B attempts to acquire lock L1 but must wait, it inserts arrow 4. When a thread must wait, we check if the wait-for graph contains a cycle. A cycle indicates deadlock: everyone is waiting for someone else to release a resource. In general, if, and only if, a wait-for graph contains a cycle, then threads are deadlocked.
当有多个锁时,避免死锁的一个好的编程策略是枚举所有锁的使用情况,并确保程序的所有线程都以相同的顺序获取锁。此规则将确保等待图中不会出现循环,从而不会出现死锁。在我们的示例中,如果上面的线程 B在ACQUIRE ( L2 ) 之前执行ACQUIRE ( L1 ),与线程 A 使用的顺序相同,那么就不会出现问题。在我们的示例程序中,程序员很容易修改程序以确保以相同的顺序获取锁,因为 ACQUIRE 语句彼此相邻显示,并且只有两个锁。然而,在实际程序中,四个ACQUIRE语句可能深埋在两个单独的模块中,线程 A 和 B 恰好以不同的顺序间接调用这两个模块,并且确保所有锁都以静态全局顺序获取需要仔细思考和设计。
When there are several locks, a good programming strategy to avoid deadlock is to enumerate all lock usages and ensure that all threads of the program acquire the locks in the same order. This rule will ensure there can be no cycles in the wait-for graph and thus no deadlocks. In our example, if thread B above did ACQUIRE(L1) before ACQUIRE (L2), the same order that thread A used, then there wouldn’t have been a problem. In our example program, it is easy for the programmer to modify the program to ensure that locks are acquired in the same order because the ACQUIRE statements are shown next to each other and there are only two locks. In a real program, however, the four ACQUIRE statements may be buried deep inside two separate modules that threads A and B happen to call indirectly in different orders, and ensuring that all locks are acquired in a static global order requires careful thinking and design.
死锁并不总是涉及多个锁。例如,如果发送方忘记在图 5.6的第 9行和第 10行中释放和获取锁,那么也可能出现死锁。如果缓冲区已满,接收方将没有机会从缓冲区中删除消息,因为它无法获取发送方持有的锁。在这种情况下,发送方正在等待接收方更改p.out的值(在等待图中,资源是p.out的值所表示的缓冲区空间),而接收方正在等待发送方释放锁。简单的编程错误可能会导致死锁。
A deadlock doesn’t always have to involve multiple locks. For example, if the sender forgets to release and acquire the lock on lines 9 and 10 of Figure 5.6, then a deadlock is also possible. If the buffer is full, the receiver will not get a chance to remove a message from the buffer because it cannot acquire the lock, which is being held by the sender. In this case, the sender is waiting on the receiver to change the value of p.out (in a wait-for graph, the resource is buffer space represented by the value of p.out), and the receiver is waiting on the sender to release the lock. Simple programming errors can lead to deadlocks.
与死锁相关的一个问题是活锁——一组线程之间的交互,其中每个线程都在重复执行某些操作,但永远无法完成整个操作序列。侧栏 5.2中给出了一个活锁示例,其中介绍了一种实现ACQUIRE和RELEASE的算法。
A problem related to deadlock is livelock—an interaction among a group of threads in which each thread is repeatedly performing some operations but is never able to complete the whole sequence of operations. An example of livelock is given in Sidebar 5.2, which presents an algorithm to implement ACQUIRE and RELEASE.
正确实现ACQUIRE和RELEASE必须强制执行单次获取协议。多个线程可能同时尝试获取锁,但只有一个线程应该成功。这一要求使锁的实现具有挑战性。本质上,我们必须确保ACQUIRE本身是一个前后操作。
A correct implementation of ACQUIRE and RELEASE must enforce the single-acquire protocol. Several threads may attempt to acquire the lock at the same time, but only one should succeed. This requirement makes the implementation of locks challenging. In essence, we must make sure that ACQUIRE itself is a before-or-after action.
要了解如果ACQUIRE不是先执行还是后执行操作会出现什么问题,请考虑图 5.7中ACQUIRE的过于简单的实现。此实现有问题,因为它存在竞争条件。如果标记为 A 和 B 的两个线程同时调用FAULTY_ACQUIRE ,则线程可能会按 A 5、B 5、A 6、B 6的顺序执行语句,这对应于以下时序图:
To see what goes wrong if ACQUIRE is not a before-or-after action, consider the too-simple implementation of ACQUIRE as shown in Figure 5.7. This implementation is broken because it has a race condition. If two threads labeled A and B call FAULTY_ACQUIRE at the same time, the threads may execute the statements in the order A5, B5, A6, B6, which corresponds to the following timing diagram:
图 5.7 ACQUIRE的错误实现。LOCKED和UNLOCKED是具有不同值的常量;例如,LOCKED为 1,而UNLOCKED为0。
Figure 5.7 Incorrect implementation of ACQUIRE. LOCKED and UNLOCKED are constants that have different values; for example, LOCKED is 1 and UNLOCKED is 0.
这个事件序列的结果是两个线程都获取了锁,这违反了单次获取协议。
The result of this sequence of events is that both threads acquire the lock, which violates the single-acquire protocol.
错误的ACQUIRE对共享变量(锁)具有多步骤操作,我们必须以某种方式确保ACQUIRE本身是一个之前或之后操作。一旦ACQUIRE是一个之前或之后操作,我们就可以使用它来将共享变量上的任意多步骤操作转变为之前或之后操作。这种简化是称为bootstrapping的技术的一个例子,它类似于归纳证明。Bootstrapping 意味着我们寻找一种系统的方法来将一般问题(例如,将共享变量上的多步骤操作变为之前或之后操作)简化为同一问题的某个更窄的特定版本(例如,将对单个共享锁的操作变为之前或之后操作)。然后,我们使用一些可能只适用于那种情况的专门方法来解决这个狭窄的问题,因为它利用了特定情况。那么,一般解决方案由两部分组成:一种解决特殊情况的方法和一种将一般问题简化为特殊情况的方法。对于ACQUIRE的情况,具体问题的解决方案要么是构建一个特殊的硬件指令,该指令本身就是一个之前或之后的操作,要么是经过非常仔细的编程。
The faulty ACQUIRE has a multistep operation on a shared variable (the lock), and we must ensure in some way that ACQUIRE itself is a before-or-after action. Once ACQUIRE is a before-or-after action, we can use it to turn arbitrary multistep operations on shared variables into before-or-after actions. This reduction is an example of a technique called bootstrapping, which resembles an inductive proof. Bootstrapping means that we look for a systematic way to reduce a general problem (e.g., making multistep operations on shared variables before-or-after actions) to some much-narrower particular version of the same problem (e.g., making an operation on a single shared lock a before-or-after action). We then solve the narrow problem using some specialized method that might work for only that case because it takes advantage of the specific situation. The general solution then consists of two parts: a method for solving the special case and a method for reducing the general problem to the special case. In the case of ACQUIRE, the solution for the specific problem is either building a special hardware instruction that is itself a before-or-after action or programming very carefully.
我们首先看一个涉及特殊指令的解决方案,即读取和设置内存(RSM)。RSM将块do atomic中的语句作为前后操作执行:
We first look at a solution involving a special instruction, Read and Set Memory (RSM). RSM performs the statements in the block do atomic as a before-or-after action:
1 过程 RSM ( reference mem ) // RSM内存位置mem
1 procedure RSM (reference mem) // RSM memory location mem
2 执行原子操作
2 do atomic
3 r ← mem // 将存储在mem 中的值加载到r中
3 r ← mem // Load value stored at mem into r
4 mem ← LOCKED // 将LOCKED存储到内存位置mem
4 mem ← LOCKED // Store LOCKED into memory location mem
5 返回 r
5 return r
大多数现代计算机都在硬件中实现某种版本的RSM 程序,作为内存抽象的扩展。RSM 通常被称为测试与设置;参见边栏 5.1。要使 RSM指令成为之前或之后的操作,控制连接处理器和内存的总线的总线仲裁器必须保证 LOAD (第3行)和STORE(第4行)指令作为之前或之后的操作执行 — — 例如,允许处理器在单个总线周期内从内存位置读取一个值并将新值写入同一位置。因此,我们将提供之前或之后操作的问题交给了总线仲裁器,总线仲裁器是一种硬件,其确切功能是将总线操作转变为之前或之后的操作:仲裁器保证如果两个请求同时到达,则其中一个请求在另一个请求开始之前完全执行。
Most modern computers implement some version of the RSM procedure in hardware, as an extension to the memory abstraction. RSM is then often called test-and-set; see Sidebar 5.1. For the RSM instruction to be a before-or-after action, the bus arbiter that controls the bus connecting the processors to the memory must guarantee that the LOAD (line 3) and STORE (line 4) instruction execute as before-or-after actions—for example, by allowing the processor to read a value from a memory location and to write a new value into that same location in a single bus cycle. We have thus pushed the problem of providing a before-or-after action down to the bus arbiter, a piece of hardware whose precise function is turning bus operations into before-or-after actions: the arbiter guarantees that if two requests arrive at the same time, one of those requests is executed completely before the other begins.
边栏 5.1 RSM、测试和设置以及避免锁定
Sidebar 5.1 RSM, Test-and-Set, and Avoiding Locks
由于偶然的原因, RSM经常被称为“测试并设置”或“测试并设置锁定”。该指令的早期版本会测试锁定,并且只有当测试显示未设置锁定时才会执行存储。该指令还设置了一个位,软件可以测试该位以确定是否已设置锁定。使用此指令,可以按如下方式实现 ACQUIRE的主体:
RSM is often called “test-and-set” or “test-and-set-locked” for accidental reasons. An early version of the instruction tested the lock and performed the store only if the test showed that the lock was not set. The instruction also set a bit that the software could test to find out whether or not the lock had been set. Using this instruction, one can implement the body of ACQUIRE AS FOLLOWS:
while TEST_AND_SET ( L ) = LOCKED 不执行任何操作
whileTEST_AND_SET (L) = LOCKED do nothing
此版本看起来比图 5.8中显示的版本短,但硬件执行了冗余测试。因此,后来的硬件设计人员从测试与设置中删除了该测试,但名称保留了下来。
This version appears to be shorter than the one shown in Figure 5.8, but the hardware performs a test that is redundant. Thus, later hardware designers removed the test from test-and-set, but the name stuck.
图 5.8 使用RSM获取和释放。
Figure 5.8 ACQUIRE and RELEASE using RSM.
除了RSM之外,还有许多其他指令,包括“测试并测试并设置”(允许更有效地实现自旋锁)和COMPARE_AND_SWAP ( v1, m, v2 )(在之前或之后操作中将内存位置m的内容与值v1进行比较,如果它们相同,则将v2存储在m中)。例如,可以使用“比较并交换”指令来实现链表,其中线程可以并发插入元素而不必使用锁,从而避免了在其他线程完成插入之前一直自旋的风险[参见进一步阅读建议 5.5.8和5.5.9 ]。这种实现称为非阻塞。
In addition to RSM, there are many other instructions, including “test-and-test-and-set” (which allows for a more efficient implementation of a spin lock) and COMPARE_AND_SWAP (v1, m, v2) (which compares, in a before-or-after action the content of a memory location m to the value v1 and, if they are the same, stores v2 in m). The “compare-and-swap” instruction can be used, for example, to implement a linked list in which threads can insert elements concurrently without having to use locks, avoiding the risk of spinning until other threads have completed their insert [see Suggestions for Further Reading 5.5.8 and 5.5.9]. Such implementations are called non-blocking.
Linux 内核还使用了另一种避免锁定的协调形式。它被称为读取-复制更新,专门针对那些读取频率最高、更新频率较低的数据结构(参见进一步阅读建议 5.5.7)。
The Linux kernel uses yet another form of coordination that avoids locks. It is called read-copy update and is tailored to data structures that are mostly read and infrequently updated [see Suggestions for Further Reading 5.5.7].
使用RSM指令,我们可以实现任何其他前后操作。它是我们可以引导任何其他前后操作集的一个基本前后操作。使用RSM,我们可以实现ACQUIRE和RELEASE ,如图5.8所示。此实现遵循单次获取协议:如果L为LOCKED,则一个线程拥有锁L;如果L包含UNLOCKED,则没有线程获取锁L。
Using the RSM instruction, we can implement any other before-or-after action. It is the one essential before-or-after action from which we can bootstrap any other set of before-or-after actions. Using RSM, we can implement ACQUIRE and RELEASE as shown in Figure 5.8. This implementation follows the single-acquire protocol: if L is LOCKED, then one thread has the lock L; if L contains UNLOCKED, then no thread has acquired the lock L.
为了确保实现正确,我们假设L为UNLOCKED。如果某个线程调用ACQUIRE ( L ) ,则在RSM之后,L为LOCKED,并且R1包含UNLOCKED,因此该线程已获取锁。调用ACQUIRE ( L ) 的下一个线程在RSM指令之后看到R1中的LOCKED,这表明其他线程持有该锁。尝试获取锁的线程将旋转,直到R1包含UNLOCKED。释放锁时不需要测试,因此普通的STORE指令可以完成该任务而不会产生竞争条件。
To see that the implementation is correct, let’s assume that L is UNLOCKED. If some thread calls ACQUIRE (L), then after RSM, L is LOCKED and R1 contains UNLOCKED, so that thread has acquired the lock. The next thread that calls ACQUIRE (L) sees LOCKED in R1 after the RSM instruction, signaling that some other thread holds the lock. The thread that tried to acquire will spin until R1 contains UNLOCKED. When releasing a lock, no test is needed, so an ordinary STORE instruction can do the job without creating a race condition.
此实现假设共享内存提供读/写一致性。例如,如果管理器线程在第 7行将L设置为UNLOCKED ,则我们假设该线程观察到该存储并在ACQUIRE的第3行退出旋转循环。有些内存提供的语义比读/写一致性更宽松;在这种情况下,需要额外的机制才能使此程序正常工作。
This implementation assumes that the shared memory provides read/write coherence. For example, if a manager thread sets L to UNLOCKED on line 7, then we assume that the thread observes that store and falls out of the spinning loop on line 3 in ACQUIRE. Some memories provide more relaxed semantics than read/write coherence; in that case, additional mechanisms are needed to make this program work correctly.
使用此实现,即使是单个线程也可能通过对同一锁调用两次ACQUIRE而使自身死锁。第一次调用ACQUIRE时,线程获得锁。第二次调用ACQUIRE时,线程死锁,因为某个线程(本身)已经持有锁。通过将锁所有者的线程标识符存储在L中(而不是 true 或 false),ACQUIRE可以检查此问题并返回错误。
With this implementation, even a single thread can deadlock itself by calling ACQUIRE twice on the same lock. With the first call to ACQUIRE, the thread obtains the lock. With the second call to ACQUIRE the thread deadlocks, since some thread (itself) already holds the lock. By storing the thread identifier of the lock’s owner in L (instead of true or false), ACQUIRE could check for this problem and return an error.
问题集6使用SET-AND-GET远程过程调用探索并发问题,该调用作为之前或之后的操作执行。
Problem set 6 explores concurrency issues using a SET-AND-GET remote procedure call, which executes as a before-or-after action.
RSM指令也可以在不扩展内存抽象的情况下实现。实际上,可以使用普通的加载和存储指令将RSM实现为软件中的过程,但这种实现很复杂。我们在没有RSM的情况下实现ACQUIRE的关键问题是多个线程试图修改同一个共享变量(在我们的例子中是L)。两个线程同时读取L是可以的(总线仲裁器确保LOAD是之前或之后的操作,并且两个线程将读取相同的值),但读取和修改L是一个多步骤操作,必须作为之前或之后的操作来执行。如果不是,这个多步骤操作可能导致竞争条件,其结果可能是违反单获取协议。这一观察结果表明了一种基于单写入原则实现RSM 的方法:确保只有一个线程修改L。
The RSM instruction can also be implemented without extending the memory abstraction. In fact, one can implement RSM as a procedure in software using ordinary load and store instructions, but such implementations are complex. The key problem that our implementation without RSM of ACQUIRE has is that several threads are attempting to modify the same shared variable (L in our example). For two threads to read L concurrently is fine (the bus arbiter ensures that LOADs are before-or-after actions, and both threads will read the same value), but reading and modifying L is a multistep operation that must be performed as a before-or-after action. If not, this multistep operation can lead to a race condition in which the outcome may be a violation of the single-acquire protocol. This observation suggests an approach to implementing RSM based on the one-writer principle: ensure that only one thread modifies L.
边栏 5.2描述了遵循该方法的软件解决方案。与RSM的硬件实现相比,该软件解决方案非常复杂。为了确保只有一个线程写入L,软件解决方案需要一个每个线程一个条目的数组。必须为每个锁分配一个这样的数组。此外,获取锁的内存访问次数与线程数成线性关系。另外,如果动态创建线程,软件解决方案需要比数组更复杂的数据结构。在效率需求和不可预测大小的数组要求之间,设计人员通常将RSM实现为调用特殊总线周期的硬件指令。
Sidebar 5.2 describes a software solution that follows that approach. This software solution is complex compared to the hardware implementation of RSM. To ensure that only one thread writes L, the software solution requires an array with one entry per thread. Such an array must be allocated for each lock. Moreover, the number of memory accesses to acquire a lock is linear in the number of threads. Also, if threads are created dynamically, the software solution requires a more complex data structure than an array. Between the need for efficiency and the requirement for an array of unpredictable size, designers generally implement RSM as a hardware instruction that invokes a special bus cycle.
边栏 5.2 构建无需特殊说明的前后动作
Sidebar 5.2 Constructing a Before-or-After Action Without Special Instructions
1959 年,著名的荷兰程序员和研究员 E. Dijkstra 向他的同事提出了一个有趣的难题,即使用普通的读写指令提供一个之前或之后的操作。Th. J. Dekker 提供了一个适用于两个线程的解决方案,而 Dijkstra 将这个想法推广为适用于任意数量线程的解决方案 [进一步阅读建议 5.5.2 ]。随后,许多研究人员一直在寻找可证明的、有效的解决方案。我们基于 L. Lamport 的解决方案提出了一种简单的RSM实现。与其他解决方案一样,Lamport 的解决方案依赖于总线仲裁器的存在,该仲裁器保证任何单个LOAD或STORE相对于其他每个LOAD和STORE都是之前或之后的操作。基于这一假设,RSM可以按如下方式实现:
In 1959, E. Dijkstra, a well-known Dutch programmer and researcher, posed to his colleagues the problem of providing a before-or-after action with ordinary read and write instructions as an amusing puzzle. Th. J. Dekker provided a solution for two threads, and Dijkstra generalized the idea into a solution for an arbitrary number of threads [Suggestions for Further Reading 5.5.2]. Subsequently, numerous researchers have looked for provable, efficient solutions. We present a simple implementation of RSM based on L. Lamport’s solution. Lamport’s solution, like other solutions, relies on the existence of a bus arbiter that guarantees that any single LOAD or STORE is a before-or-after action with respect to every other LOAD and STORE. Given this assumption, RSM can be implemented as follows:
共享布尔 标志[N] //每个线程一个布尔值
shared boolean flag[N] // one boolean per thread
1 过程 RSM (锁 引用 L ) // 设置锁L并返回旧值
1 procedure RSM (lock reference L) // set lock L and return old value
2 永远做 //我是标志中的我的索引
2 do forever // me is my index in flag
3 flag [ me ] ← TRUE // 警告其他线程
3 flag[me] ← TRUE // warn other threads
4 如果 ANYONE_ELSE_INTERESTED ( me )那么 // 是否有另一个线程警告我们?
4 if ANYONE_ELSE_INTERESTED (me) then // is another thread warning us?
5 flag [ me ] ← FALSE // 是的,重置我的警告,再试一次
5 flag[me] ← FALSE // yes, reset my warning, try again
6 其他
6 else
7 R ← L.state // 将R设置为锁的值
7 R ← L.state // set R to value of lock
8 L.state ← LOCKED //并设置锁
8 L.state ← LOCKED // and set the lock
9 标记[我] ← FALSE
9 flag[me] ← FALSE
10 返回 R
10 return R
11
11
12 程序 ANYONE_ELSE_INTERESTED ( me ) //是否有另一个线程正在更新L?
12 procedure ANYONE_ELSE_INTERESTED (me) //is another thread updating L?
13 for i 从0到N- 1
13 for i from 0 to N-1 do
14 如果 i ≠ me 且 flag [ i ] = TRUE则返回 TRUE
14 if i ≠ me and flag[i] = TRUEthen return TRUE
15 返回 FALSE
15 return FALSE
为了保证RSM确实是一个之前或之后的操作,我们需要假设共享数组的每个条目都在它自己的存储单元中,内存为存储单元提供读/写一致性,并且指令按程序顺序执行,就像我们在图 5.5中对发送方和接收方所做的那样。
To guarantee that RSM is indeed a before-or-after action, we need to assume that each entry of the shared array is in its own memory cell, that the memory provides read/write coherence for memory cells, and that the instructions execute in program order, as we did for the sender and receiver in Figure 5.5.
在这些假设下,RSM确保共享变量L永远不会被两个线程同时写入。每个线程都有一个唯一的编号me。在允许me写入L之前,它必须通过设置布尔数组标志中的me条目来表达对写入L的兴趣(第3行),并检查没有其他线程对写入L感兴趣(第4行)。如果没有其他线程表示感兴趣,那么me将获取L(第8行)。
Under these assumptions, RSM ensures that the shared variable L is never written by two threads at the same time. Each thread has a unique number, me. Before me is allowed to write L, it must express its interest in writing L by setting me’s entry in the boolean array flag (line 3) and check that no other thread is interested in writing L (line 4). If no other thread has expressed interest, then me acquires L (line 8).
如果两个线程 A 和 B同时调用RSM ,则 A 或 B 可能会获取L ,或者两者都可能重试,具体取决于共享内存系统如何对 A 和 B 对flag [ i ] 数组的访问进行排序。有三种情况:
If two threads A and B call RSM at the same time, either A or B may acquire L, or both may retry, depending on how the shared memory system orders the accesses of A and B to the flag[i] array. There are three cases:
1.A 设置标志[A],调用ANYONE_ELSE_INTERESTED ,并在 B 设置标志 [B] 之前至少读取标志 [B ]。在这种情况下,A 发现没有其他标志设置,并继续获取L ;B 发现 A 的标志并再次尝试。在下一次尝试中,B 不会遇到任何标志,但当 B 将LOCKED写入L时,L已设置为LOCKED,因此 B 的写入将不起作用。
1. A sets flag[A], calls ANYONE_ELSE_INTERESTED, and reads flags at least as far as flag[B] before B sets flag[B]. In this case, A sees no other flags set and proceeds to acquire L; B discovers A’s flag and tries again. On its next try, B encounters no flags, but by the time B writes LOCKED to L, L is already set to LOCKED, so B’s write will have no effect.
2.B 设置标志[B],调用ANYONE_ELSE_INTERESTED ,并在 A 设置标志[A]之前至少读取标志[ A ]。在这种情况下,B 发现没有其他标志设置,并继续获取L ;A 发现 B 的标志并再次尝试。在下一次尝试中,A 不会遇到任何标志,但当 A 将锁定写入L时,L已被设置为锁定,因此 A 的写入将不起作用。
2. B sets flag[B], calls ANYONE_ELSE_INTERESTED, and reads flags at least as far as flag[A] before A sets flag[A]. In this case, B sees no other flags set and proceeds to acquire L; A discovers B’s flag and tries again. On its next try, A encounters no flags, but by the time A writes locked to L, L is already set to locked, so A’s write will have no effect.
3.在ANYONE_ELSE_INTERESTED到达对方的标志位置之前,A 设置标志[A],B 设置标志[B]。在这种情况下,A 和 B 都会重置自己的标志[i] 条目并重试。重试时,所有三种情况都再次可能发生。
3. A sets flag[A] and B sets flag[B] before either of them gets far enough through ANYONE_ELSE_INTERESTED to reach the other’s flag location. In this case, both A and B reset their own flag[i] entries and try again. On the retry, all three cases are again possible.
RSM的实现存在活锁问题,因为两个线程 A 和 B每次重试时都可能陷入最后一种情况(它们都不会更新L ) 。RSM可以通过在重试之前插入随机延迟来降低活锁的可能性,这种技术称为随机退避。第 7 章 [在线] 将改进随机退避思想,使其适用于更广泛的问题。
The implementation of RSM has a livelock problem because the two threads A and B might end up in the final case (neither of them gets to update L), every time they retry. RSM could reduce the chance of livelock by inserting a random delay before retrying, a technique called random backoff. Chapter 7 [on-line] will refine the random backoff idea to make it applicable to a wider range of problems.
RSM的这个实现不是最高效的;它与线程数呈线性关系,因为ANYONE_ELSE_INTERESTED会读取除一个元素之外的所有数组flag 。存在更高效的RSM版本,但即使是最好的实现 [进一步阅读建议 5.5.3 ] 也需要两次加载和五次存储(如果L没有争用),这在给定的假设下可以证明是最佳的。
This implementation of RSM is not the most efficient one; it is linear in the number of threads because ANYONE_ELSE_INTERESTED reads all but one element of the array flag. More efficient versions of RSM exist, but even the best implementation [Suggestions for Further Reading 5.5.3] requires two loads and five stores (if there is no contention for L), which can be proven to be optimal under the given assumptions.
如果严格遵循单一写入者原则,则可以编写不使用锁的程序(例如,如图5.5所示)。这种不使用锁的方法可以提高程序的性能,因为避免了锁的开销,但消除锁会使推断正确性变得更加困难。
If one follows the one-writer principle carefully, one can write programs without locks (for example, as in Figure 5.5). This approach without locks can improve a program’s performance because the expense of locks is avoided, but eliminating locks makes it more difficult to reason about the correctness.
航天飞机计算机系统的设计人员使用多个线程共享多个变量,并部署了系统化的设计方法来确保正确实现。航天飞机的计算机设计于 20 世纪 70 年代末和 80 年代初,效率不够高,无法遵循使用锁保护所有共享变量的原则性方法。然而,设计人员了解到在并发线程之间共享变量的风险,因此遵循一条规则,即每个未受保护的共享变量的程序声明都必须附带一条注释(称为借口),解释为什么即使该变量未受保护也不会发生竞争条件。在每次发布新的软件版本时,工程师团队都会检查所有借口并确定它们是否仍然成立。虽然这种方法的验证开销很高,但它有助于发现许多竞争条件,否则这些竞争条件可能会被忽视,直到为时已晚。使用借口是面向迭代设计的一个例子。
The designers of the computer system for the space shuttle used many threads sharing many variables, and they deployed a systematic design approach to encourage a correct implementation. Designed in the late 1970s and early 1980s, the computers of the space shuttle were not efficient enough to follow the principled way of protecting all shared variables using locks. Understanding the risks of sharing variables among concurrent threads, however, the designers followed a rule that the program declaration for each unprotected shared variable must be accompanied by a comment, known as an alibi, explaining why no race conditions can occur even though that variable is unprotected. At each new release of the software, a team of engineers inspects all alibis and checks whether they still hold. Although this method has a high verification overhead, it helps discover many race conditions that otherwise might go undetected until too late. The use of alibis is an example of design for iteration.
正如本章所见,所有前或后操作的实现都依赖于正常运行的硬件仲裁器的引导。这种依赖应该引起硬件设计人员的注意,他们知道在某些情况下,实现完美的仲裁器可能会有问题(实际上,理论上是不可能的)。本节解释了硬件设计人员在实践中处理这个问题的原因和方式。系统设计人员需要知道仲裁器如何失败,这样他们就知道要向他们所依赖的硬件的设计人员询问什么问题。
As has been seen in this chapter, all implementations of before-or-after actions rely on bootstrapping from a properly functioning hardware arbiter. This reliance should catch the attention of hardware designers, who are aware that under certain conditions, it can be problematic (indeed, theoretically impossible) to implement a perfect arbiter. This section explains why and how hardware designers deal with this problem in practice. System designers need to be aware of how arbiters can fail, so that they know what questions to ask the designer of the hardware on which they rely.
问题出现在异步和同步组件之间的接口处,当为同步子系统提供输入的仲裁器被要求在两个异步但间隔很近的输入信号之间进行选择时。异步输入仲裁器可以进入亚稳态,输出值介于两个正确值之间,或者可能以高速率在它们之间振荡。*因此,在将异步信号应用于仲裁器后,必须等待仲裁器的输出稳定下来。虽然仲裁器输出未稳定的概率呈指数级下降,但对于任何给定的延迟时间,仲裁器始终有可能尚未稳定,其输出样本可能会发现它仍在变化。通过等待更长时间,可以将其未稳定的概率降低到特定应用所需的最小值,但不可能在固定时间内将其降至零。因此,如果获取仲裁器输出的组件是同步的,则当其时钟滴答作响时,组件的输入(即仲裁器的输出)可能尚未准备好。发生这种情况时,组件可能会表现得不可预测,从而引发一连串故障。虽然仲裁器本身肯定会在某个时刻做出决定,但在时钟滴答作响之前没有做出决定被称为仲裁器故障。
The problem arises at the interface between asynchronous and synchronous components, when an arbiter that provides input to a synchronous subsystem is asked to choose between two asynchronous but closely spaced input signals. An asynchronous-input arbiter can enter a metastable state, with an output value somewhere between its two correct values or possibly oscillating between them at a high rate.* After applying asynchronous signals to an arbiter, one must therefore wait for the arbiter’s output to settle. Although the probability that the output of the arbiter has not settled falls exponentially fast, for any given delay time some chance always remains that the arbiter has not settled yet, and a sample of its output may find it still changing. By waiting longer, one can reduce the probability of it not having settled to as small a figure as necessary for any particular application, but it is impossible to drive it to zero within a fixed time. Thus if the component that acquires the output of the arbiter is synchronous, when its clock ticks there is a chance that the component’s input (that is, the arbiter’s output) is not ready. When that happens, the component may behave unpredictably, launching a chain of failure. Although the arbiter itself will certainly come to a decision at some point, not doing so before the clock ticks is known as arbiter failure.
Arbiter failure can be avoided in several ways:
同步两个组件的时钟。如果两个处理器、仲裁器和内存都使用一个公共时钟(更准确地说,它们的所有接口都是同步的),仲裁器的设计就变得简单了。例如,这种技术可用于仲裁某些具有多个处理器的芯片内的访问。
Synchronize the clocks of the two components. If the two processors, the arbiter, and the memory all operate with a common clock (more precisely, all of their interfaces are synchronous), arbiter design becomes straightforward. This technique is used, for example, to arbitrate access within some chips that have several processors.
设计具有多个阶段的仲裁器。多个阶段并不能消除仲裁器故障的可能性,但每个附加阶段都会成倍地降低故障概率。该策略是提供足够的阶段,以使故障概率低到可以忽略不计。使用当前的技术,两到三个阶段通常就足够了,并且这种技术用于异步和同步设备之间的大多数接口。
Design arbiters with multiple stages. Multiple stages do not eliminate the possibility of arbiter failure, but each additional stage multiplicatively reduces the probability of failure. The strategy is to provide enough stages that the probability of failure is so low that it can be neglected. With current technology, two or three stages are usually sufficient, and this technique is used in most interfaces between asynchronous and synchronous devices.
停止同步组件的时钟(从而有效地使其异步),并等待仲裁器的输出稳定后再重新启动。在现代高性能系统中,时钟分配需要连续的滴答声来提供纠正相位误差的信号,因此在实践中并不经常遇到这种技术。
Stop the clock of the synchronous component (thus effectively making it asynchronous) and wait for the arbiter’s output to settle before restarting. In modern high-performance systems, clock distribution requires continuous ticks to provide signals for correcting phase errors, so one does not often encounter this technique in practice.
使所有组件异步。接收仲裁器输出的组件只需等待,直到仲裁器报告其已稳定。20 世纪 70 年代,人们对异步电路设计产生了浓厚的兴趣,但事实证明同步电路更容易设计,因此胜出。然而,随着时钟速度增加到即使在单个芯片上分配时钟也变得困难的程度,人们的兴趣又重新燃起。
Make all components asynchronous. The component that takes the output of the arbiter then simply waits until the arbiter reports that it has settled. A flurry of interest in asynchronous circuit design arose in the 1970s, but synchronous circuits proved to be easier to design and so won out. However, as clock speeds increase to the point that it is difficult to distribute clock even across a single chip, interest is reawakening.
网络链路上的通信几乎总是异步的,同一盒子中的设备之间的通信(例如,磁盘驱动器和处理器之间的通信)通常也是异步的,并且如上文最后一条所述,随着技术的进步,门延迟减少,即使在单个芯片上保持一个通用的、足够快的时钟也变得越来越具有挑战性。因此,芯片内的相互通信变得更像网络,同步岛通过异步链路连接起来(例如,参见进一步阅读建议 1.6.3)。
Communication across a network link is nearly always asynchronous, communication between devices in the same box (for example, between a disk drive and a processor) is usually asynchronous, and as mentioned in the last bullet above, as advancing technology reduces gate delays, it is becoming challenging to maintain a common, fast-enough clock all the way across even a single chip. Thus, within-chip intercommunication is becoming more network-like, with synchronous islands connected by asynchronous links (see, for example, Suggestions for Further Reading 1.6.3).
正如所指出的,仲裁器故障只是同步和异步组件边界上的问题。多年来,该边界随着技术的变化而移动。作者不知道RSM () 或其等效物的任何当前实现是否跨越同步/异步边界(换句话说,当前的多处理器实践是使用上面第一个方法)。因此,基于RSM ()的前后原子性不存在仲裁器故障的风险。但过去情况并非如此,未来某个时候可能也不再如此。因此,系统设计人员需要了解仲裁器的使用位置,并验证它们是否适合应用程序。
As pointed out, arbiter failure is an issue only at the boundary between synchronous and asynchronous components. Over the years, that boundary has moved with changing technology. The authors are not aware of any current implementations of RSM () or their equivalents that cross a synchronous/asynchronous boundary (in other words, current multiprocessor practice is to use the method of the first bullet above). Thus, before-or-after atomicity based on RSM () is not at risk of arbiter failure. But that was not true in the past, and it may not be true again at some point in the future. The system designer thus needs to be aware of where arbiters are being used and verify that they are specified appropriately for the application.
有界缓冲区的实现利用了所有线程共享同一物理内存这一事实(参见第 209 页的图 5.4),但共享内存不能很好地强制模块化。程序可能会错误地计算共享地址并写入逻辑上属于另一个模块的内存位置。要强制模块化,我们必须确保一个模块的线程不能覆盖另一个模块的数据。本节介绍了域和内存管理器来强制内存边界,假设地址空间非常大(即大到我们可以认为它是无限的)。在第 5.4 节中,我们将删除该假设。
The implementation of bounded buffers took advantage of the fact that all threads share the same physical memory (see Figure 5.4 on page 209), but sharing memory does not enforce modularity well. A program may calculate a shared address incorrectly and write to a memory location that logically belongs to another module. To enforce modularity, we must ensure that the threads of one module cannot overwrite the data of another module. This section introduces domains and a memory manager to enforce memory boundaries, assuming that the address space is very large (i.e., so large that we can consider it unlimited). In Section 5.4, we will remove that assumption.
为了包含线程的内存引用,我们将线程的引用限制为一个域,即一个连续的内存地址范围。当程序员调用ALLOCATE_THREAD时,程序员会指定线程要在其中运行的域。线程管理器会记录线程的域。
To contain the memory references of a thread, we restrict the thread’s references to a domain, a contiguous range of memory addresses. When a programmer calls ALLOCATE_THREAD, the programmer specifies a domain in which the thread is to run. The thread manager records a thread’s domain.
为了强制执行线程只能引用其域内内存的规则,我们为每个处理器添加了一个域寄存器,并引入了一个特殊的解释器,即内存管理器,它通常用硬件实现,位于处理器和总线之间(见图5.9 )。处理器的域寄存器包含当前运行的线程可以使用的最低 ( low ) 和最高地址 ( high ) 。ALLOCATE_THREAD将线程的域加载到处理器的域寄存器中。
To enforce the rule that a thread should refer only to memory within its domain, we add a domain register to each processor and introduce a special interpreter, a memory manager, that is typically implemented in hardware and placed between a processor and the bus (see Figure 5.9). A processor’s domain register contains the lowest (low) and highest address (high) that the currently running thread is allowed to use. ALLOCATE_THREAD loads the processor’s domain register with the thread’s domain.
图 5.9与其域一起运行的编辑器线程。
Figure 5.9 An editor thread running with its domain.
内存管理器检查每个内存引用的地址是否等于或高于低位且小于高位。如果是,内存管理器发出相应的总线请求。如果不是,它会中断处理器,发出内存引用异常信号。然后异常处理程序可以决定做什么。一种选择是传递错误消息并销毁线程。
The memory manager checks for each memory reference that the address is equal or higher than low and smaller than high. If it is, the memory manager issues the corresponding bus request. If not, it interrupts the processor, signaling a memory reference exception. The exception handler can then decide what to do. One option is to deliver an error message and destroy the thread.
这种设计确保线程只能引用其域内的地址。线程无法覆盖或跳转到其他线程的内存位置。
This design ensures that a thread can make references only to addresses that are in its domain. Threads cannot overwrite or jump to memory locations of other threads.
这个域名设计实现了主要目标,但缺少许多理想的特性:
This domain design achieves the main goal, but it lacks a number of desirable features:
1.一个线程可能需要多个域。通过使用多个域,线程可以控制它们共享哪些内存以及它们保留哪些内存为私有内存。例如,一个线程可以为有界缓冲区分配一个域并与另一个线程共享该域,但为其程序文本和私有数据结构分配私有域。
1. A thread may need more than one domain. By using many domains, threads can control what memory they share and what memory they keep private. For example, a thread might allocate a domain for a bounded buffer and share that domain with another thread, but allocate private domains for the text of its program and private data structures.
2.线程应该无法更改其自己的域。也就是说,我们必须确保线程不能直接或间接地更改其处理器域寄存器的内容。如果线程可以更改其处理器域寄存器的内容,那么该线程就可以引用它不应该引用的地址。
2. A thread should be unable to change its own domain. That is, we must ensure that the thread cannot change the content of its processor’s domain register directly or indirectly. If a thread can change the content of its processor’s domain register, then the thread can make references to addresses that it shouldn’t.
第 5.3 节的其余部分添加了这些功能。
The rest of Section 5.3 adds these features.
为了实现共享,我们扩展了设计,允许每个线程拥有多个域,并为每个处理器提供多个域寄存器,目前,线程需要多少个域寄存器就有多少个。现在,设计人员可以对图 5.9中所示的程序的内存进行分区并控制共享。例如,设计人员可以将客户端分成四个独立的域(参见图 5.10):一个域包含客户端线程的程序文本,一个域包含客户端的数据,一个域包含客户端线程的堆栈,一个域包含有界消息缓冲区。设计人员可以以相同的方式拆分服务。此设置允许两个线程使用共享的有界缓冲区域,但将线程的其他引用限制在其私有域中。
To allow for sharing, we extend the design to allow each thread to have several domains and give each processor several domain registers, for the moment, as many as a thread needs. Now a designer can partition the memory of the programs shown in Figure 5.9 and control sharing. For example, a designer may split a client into four separate domains (see Figure 5.10): one domain containing the program text for the client thread, one domain containing the data for the client, one domain containing the stack of the client thread, and one domain containing the bounded message buffer. The designer may split a service in the same way. This setup allows both threads to use the shared bounded buffer domain, but restricts the other references of the threads to their private domains.
图 5.10一个客户端和服务,各有三个私有域和一个共享域。
Figure 5.10 A client and service, each with three private domains and one shared domain.
为了管理这个硬件设计,我们向内存管理器引入了一个软件组件,它提供了以下接口:
To manage this hardware design, we introduce a software component to the memory manager, which provides the following interface:
base_address ← ALLOCATE_DOMAIN ( size ):分配一个大小为size字节的新域,并返回该域的基地址。
base_address ← ALLOCATE_DOMAIN (size): Allocate a new domain of size bytes and return the base address of the domain.
MAP_DOMAIN( base_address ):将从地址base_address开始的域添加到调用线程的域。
MAP_DOMAIN (base_address): Add the domain starting at address base_address to the calling thread’s domains.
内存管理器可以通过保存未使用的内存区域列表、根据ALLOCATE_DOMAIN请求分配size字节的内存以及维护包含已分配域的域表来实现此接口。域表中的条目记录了base_address和size。
The memory manager can implement this interface by keeping a list of memory regions that are not in use, allocate size bytes of memory on an ALLOCATE_DOMAIN request, and maintain a domain table with allocated domains. An entry in the domain table records the base_address and size.
MAP_DOMAIN将域的边界从域表加载到线程处理器的域寄存器中。如果两个或多个线程映射一个域,则该域在这些线程之间共享。
MAP_DOMAIN loads the domain’s bounds from the domain table into a domain register of the thread’s processor. If two or more threads map a domain, then that domain is shared among those threads.
我们可以通过扩展每个域寄存器来记录访问权限来改进控制机制。在典型的设计中,域寄存器可能包含三个位,分别控制对相关域中任何字节的READ、WRITE或EXECUTE(即检索并用作指令)的权限。
We can improve the control mechanism by extending each domain register to record access permissions. In a typical design, a domain register might include three bits, which separately control permission to READ, WRITE, or EXECUTE (i.e., retrieve and use as an instruction) any of the bytes in the associated domain.
有了这些权限,设计人员可以赋予图 5.10 中的线程对其文本域的EXECUTE和READ权限,以及对其堆栈域和共享有界缓冲区域的READ和WRITE权限。此设置可防止线程从其堆栈中获取指令,这有助于发现编程错误,还有助于避免缓冲区溢出攻击(参见边栏 11.4 [在线])。仅向程序文本授予READ和EXECUTE权限有助于避免意外将数据写入程序文本的错误。
With these permissions, the designer can give the threads in Figure 5.10 EXECUTE and READ permissions to their text domains, and READ and WRITE permissions to their stack domains and the shared bounded buffer domain. This setup prevents a thread from taking instructions from its stack, which can help catch programming mistakes and also help avoid buffer overrun attacks (see Sidebar 11.4 [on-line]). Giving the program text only READ and EXECUTE permissions helps avoid the mistake of accidentally writing data into the text of the program.
权限还允许更多受控的共享:一个线程只能以读取权限访问共享域,而另一个线程可以具有读取和写入权限。
The permissions also allow more controlled sharing: one thread can have access to a shared domain with only READ permission, whereas another thread can have READ and WRITE permissions.
为了提供权限,我们修改MAP_DOMAIN调用如下:
To provide permissions, we modify the MAP_DOMAIN call as follows:
MAP_DOMAIN ( base_address, permission ):将域的边界从域表加载到调用线程的域寄存器之一中,并使用权限permission。
MAP_DOMAIN (base_address, permission): loads the domain’s bounds from the domain table into one of the calling thread’s domain registers with permission permission.
要检查权限,内存管理器必须知道每个内存引用需要哪些权限。LOAD指令需要对其地址具有READ权限,因此内存管理器必须检查该地址是否位于具有READ访问权限的域中。STORE指令需要对其地址具有WRITE权限,因此内存管理器必须检查该地址是否位于具有WRITE访问权限的域中。要在PC中的地址执行指令,需要EXECUTE权限。保存指令的域可能还需要READ权限,因为程序可能已在程序文本中存储常量。
To check permissions, the memory manager must know which permissions are needed for each memory reference. A LOAD instruction requires READ permission for its address, and thus the memory manager must check that the address is in a domain with READ access. A STORE instruction requires WRITE permission for its address, and thus the memory manager must check that the address is in a domain with WRITE access. To execute an instruction at the address in the PC requires EXECUTE permission. The domain holding instructions may also require READ permission because the program may have stored constants in the program text.
图 5.11中的伪代码详细说明了内存管理器执行的检查。虽然我们使用伪代码描述了内存管理器的功能,但实际上内存管理器是一种硬件设备,它通过数字电路实现其功能。此外,内存管理器通常与处理器集成在一起,因此地址检查以处理器的速度运行。随着第 5.3 节的进展,我们将向内存管理器添加更多功能,其中一些功能可能作为操作系统的一部分在软件中实现。稍后,在第 5.4.4 节中,我们将讨论在软件中实现内存管理器部分功能所涉及的权衡。
The pseudocode in Figure 5.11 details the check performed by the memory manager. Although we describe the function of the memory manager using pseudocode, in practice the memory manager is a hardware device that implements its function in digital circuitry. In addition, the memory manager is typically integrated with the processor so that the address checks run at the speed of the processor. As Section 5.3 develops, we will add more functions to the memory manager, some of which may be implemented in software as part of the operating system. Later, in Section 5.4.4, we discuss the trade-offs involved in implementing parts of the memory manager in software.
图 5.11内存管理器查找地址和检查权限的伪代码。
Figure 5.11 The memory manager’s pseudocode for looking up an address and checking permissions.
如图所示,在内存引用时,内存管理器会检查处理器的所有域寄存器。对于每个域寄存器,内存管理器都会调用CHECK_DOMAIN,它接受三个参数:处理器请求的地址、具有当前指令所需权限的位掩码以及域寄存器。如果地址介于域的低位和高位之间,并且所需的权限是域授权权限的子集,则CHECK_DOMAIN返回TRUE,内存管理器将发出所需的总线请求。如果地址介于域的低位和高位之间,但所需的权限不足,则CHECK_DOMAIN会中断处理器,并像以前一样指示内存引用异常。但是,现在将内存引用异常分为两个不同的类别很有用:非法内存引用异常和权限错误异常。如果地址不属于任何域,则异常处理程序指示非法内存引用异常。如果地址属于域,但线程没有足够的权限,则异常处理程序指示权限错误异常。
As shown in the figure, on a memory reference, the memory manager checks all the processor’s domain registers. For each domain register, the memory manager calls CHECK_DOMAIN, which takes three arguments: the address the processor requested, a bit mask with the permissions needed by the current instruction, and the domain register. If address falls between low and high of the domain and if the permissions needed are a subset of permissions authorized for the domain, then CHECK_DOMAIN returns TRUE and the memory manager will issue the desired bus request. If address falls between low and high of the domain but the permissions needed aren’t sufficient, then CHECK_DOMAIN interrupts the processor, indicating a memory reference exception as before. Now, however, it is useful to demultiplex the memory reference exception in two different categories: illegal memory reference exception and permission error exception. If address doesn’t fall in any domain, the exception handler indicates an illegal memory reference exception. If the address falls in a domain, but the threads didn’t have sufficient permissions, the exception handler indicates a permission error exception.
内存引用异常的解复用可以用硬件或软件实现。如果用硬件实现,内存管理器会向处理器发出非法内存引用异常或权限错误异常信号。如果用软件实现,内存管理器会发出内存引用异常信号,内存引用异常的异常处理程序会通过调用非法内存引用异常处理程序或权限错误异常处理程序对其进行解复用。正如我们将在第 6 章(参见第 6.2.2 节)中看到的那样,我们将进一步细化内存异常的类别。在现场处理器中,部分解复用是用硬件实现的,进一步的解复用是在异常处理程序中使用软件实现的。
The demultiplexing of memory reference exceptions can be implemented either in hardware or software. If implemented in hardware, the memory manager signals an illegal memory reference exception or a permission error exception to the processor. If implemented in software, the memory manager signals a memory reference exception, and the exception handler for memory reference exceptions demultiplexes it by calling either the illegal memory reference exception handler or the permission error exception handler. As we will see in Chapter 6 (see Section 6.2.2), we will want to refine the categories of memory exceptions further. In processors in the field, some of the demultiplexing is implemented in hardware with further demultiplexing implemented in software in the exception handler.
实际上,只有可能的权限组合的子集是有用的。所使用的权限组合有:READ权限、READ和WRITE权限、READ和EXECUTE权限以及READ、WRITE和EXECUTE权限。READ 、WRITE和EXECUTE权限组合可能用于包含生成指令然后跳转到所生成指令的程序的域,即所谓的自修改程序。但是,支持自修改程序是有风险的,因为这也允许对手将新过程作为数据写入域(例如,使用缓冲区溢出攻击),然后执行该过程。实际上,除了系统破解者之外,自修改程序已被证明会带来更多的麻烦和价值。只有EXECUTE权限或只有WRITE和EXECUTE权限的域在实践中是没用的。
In practice, only a subset of the possible combinations of permissions are useful. The ones used are: READ permission, READ and WRITE permissions, READ and EXECUTE permissions, and READ, WRITE, and EXECUTE permissions. The READ, WRITE, and EXECUTE combination of permissions might be used for a domain that contains a program that generates instructions and then jumps to the generated instructions, so-called self-modifying programs. Supporting self-modifying programs is risky, however, because this also allows an adversary to write a new procedure as data into a domain (e.g., using a buffer overrun attack) and then execute that procedure. In practice, self-modifying programs have proven to be more trouble than they are worth, except to system crackers. A domain with only EXECUTE permission or with just WRITE and EXECUTE permissions isn’t useful in practice.
如果使用内存映射 I/O(如第 2.3.1 节所述),则域寄存器还可以控制线程可以使用哪些设备。例如,键盘管理器线程可以访问与键盘控制器的寄存器相对应的域。如果其他线程都无法访问此域,则只有键盘管理器线程可以访问键盘设备。因此,控制内存范围分离的相同技术也可以控制对设备的访问。
If memory-mapped I/O (described in Section 2.3.1) is used, then domain registers can also control which devices a thread can use. For example, a keyboard manager thread may have access to a domain that corresponds to the registers of the keyboard controller. If none of the other threads has access to this domain, then only the keyboard manager thread has access to the keyboard device. Thus, the same technique that controls separation of memory ranges can also control access to devices.
内存管理器可以实现安全策略,因为它控制哪些线程可以访问内存的哪些部分。它可以拒绝或允许线程分配新域或共享现有域的请求。同样,它可以控制哪些线程可以访问哪些设备。如何实现此类安全策略是第 11 章 [在线] 的主题。
The memory manager can implement security policies because it controls which threads have access to which parts of memory. It can deny or grant a thread’s request for allocating a new domain or for sharing an existing domain. In the same way, it can control which threads have access to which devices. How to implement such security policies is the topic of Chapter 11 [on-line].
域寄存器限制了线程可以引用的地址,但我们还必须确保线程不能更改其域。也就是说,我们需要一种机制来控制对域寄存器的low、high和permission字段的更改。为了完成域的实施,我们修改了处理器以防止线程覆盖域寄存器的内容,如下所示:
Domain registers restrict the addresses to which a thread can make reference, but we must also ensure that a thread cannot change its domains. That is, we need a mechanism to control changes to the low, high, and permission fields of a domain register. To complete the enforcement of domains, we modify the processor to prevent threads from overwriting the content of domain registers as follows:
在处理器中添加一位,指示处理器处于内核模式还是用户模式,并按如下方式修改处理器。指令只能在内核模式下更改处理器域寄存器的值,而试图在用户模式下更改域寄存器的指令会产生非法指令异常。同样,指令只能在内核模式下更改模式位。
Add one bit to the processor indicating whether the processor is in kernel mode or user mode, and modify the processor as follows. Instructions can change the value of the processor’s domain registers only in kernel mode, and instructions that attempt to change the domain register in user mode generate an illegal instruction exception. Similarly, instructions can change the mode bit only in kernel mode.
扩展域的权限集以包含KERNEL-ONLY,并修改处理器以使用户模式下的线程引用具有KERNEL-ONLY权限的域中的地址成为非法行为。尝试以KERNEL-ONLY权限读取或写入内存的用户模式下的线程将引发权限错误异常。
Extend the set of permissions for a domain to include KERNEL-ONLY, and modify the processor to make it illegal for threads in user mode to reference addresses in a domain with KERNEL-ONLY permission. A thread in user mode that attempts to read or write memory with KERNEL-ONLY permission causes a permission error exception.
在发生中断和异常时切换到内核模式,以便处理程序可以处理中断(或异常)并调用特权指令。
Switch to kernel mode on an interrupt and on an exception so that the handler can process the interrupt (or exception) and invoke privileged instructions.
我们可以使用如图5.12所示的机制。与图 5.10相比,每个线程都有两个额外的域,标记为K(代表KERNEL-ONLY)。线程必须处于内核模式才能引用此域。此域包含内存管理器的程序文本和数据。这些机制确保在用户模式下运行的线程不能更改其处理器的域寄存器;只有当线程在内核模式下执行时,它才能更改处理器域寄存器。此外,由于内存管理器及其表位于内核域中,因此用户模式下的线程不能更改其域信息。我们看到,内核/用户模式位通过限制用户模式下的线程可以执行的操作来帮助强制模块化。
We can use the mechanisms as illustrated in Figure 5.12. Compared to Figure 5.10, each thread has two additional domains, which are marked K for KERNEL-ONLY. A thread must be in kernel mode to be able to make references to this domain. This domain contains the program text and data for the memory manager. These mechanisms ensure that a thread running in user mode cannot change its processor’s domain registers; only when a thread executes in kernel mode can it change the processor domain registers. Furthermore, because the memory manager and its table are in kernel domains, a thread in user mode cannot change its domain information. We see that the kernel/user mode bit helps in enforcing modularity by restricting what threads in user mode can do.
图 5.12具有包含内存管理器及其域表的内核域的线程。
Figure 5.12 Threads with a kernel domain containing the memory manager and its domain table.
由于在用户模式下运行的线程无法直接调用内存管理器的过程,因此线程必须有一种从用户模式切换到内核模式并以受控方式进入内存管理器的方法。如果线程可以从任意地址进入,则可能会产生问题;例如,如果线程可以在将用户模式位设置为KERNEL 的指令处以内核权限进入域,则它可能能够以内核权限获得对处理器的控制权。为了避免此问题,线程只能在某些地址(称为门)进入内核域。
Because threads running in user mode cannot invoke procedures of the memory manager directly, a thread must have a way of changing from user mode to kernel mode and entering the memory manager in a controlled manner. If a thread could enter at an arbitrary address, it might create problems; for example, if a thread could enter a domain with kernel permission at the instruction that sets the user-mode bit to KERNEL, it might be able to gain control of the processor with kernel privileges. To avoid this problem, a thread may enter a kernel domain only at certain addresses, called gates.
我们通过向处理器添加一条特殊指令来实现门,即管理程序调用指令( SVC ),该指令在寄存器中指定目标门的名称。执行SVC指令后,处理器将执行两项操作作为一项操作:
We implement gates by adding one more special instruction to the processor, the supervisor call instruction (SVC), which specifies in a register the name for the intended gate. Upon executing the SVC instruction, the processor performs two operations as one action:
1. Change the processor mode from user to kernel.
2. Set the PC to an address predefined by the hardware, the entry point of the gate manager.
现在,门管理器在内核模式下拥有控制权;它可以调用适当的过程来满足线程的请求。通常,门名称是数字,门管理器有一个表,用于记录每个门号对应的过程。例如,该表可能将门 0 映射到ALLOCATE_DOMAIN,将门 1 映射到MAP_DOMAIN,将门 2 映射到ALLOCATE_THREAD,等等。
The gate manager now has control in kernel mode; it can call the appropriate procedures to serve the thread’s request. Typically, gate names are numbers, and the gate manager has a table that records for each gate number the corresponding procedure. For example, the table might map gate 0 to ALLOCATE_DOMAIN, gate 1 to MAP_DOMAIN, gate 2 to ALLOCATE_THREAD, and so on.
实现SVC有一点复杂:进入内核的步骤必须作为前后操作发生:它们必须全部不间断地执行。如果处理器可以在这些步骤的中间被中断,则线程可能会进入内核模式,但程序计数器仍指向其某个用户级域中的地址。现在,线程正在内核模式下执行来自应用程序模块的指令。为了避免这个潜在问题,处理器在执行另一条指令之前完成SVC指令的所有步骤。
Implementing SVC has a slight complication: the steps to enter the kernel must happen as a before-or-after action: they must all be executed without interruption. If the processor can be interrupted in the middle of these steps, a thread might end up in kernel mode but with the program counter still pointing to an address in one of its user-level domains. Now the thread is executing instructions from an application module in kernel mode. To avoid this potential problem, processors complete all steps of an SVC instructions before executing another instruction.
当线程想要返回用户模式时,它会执行下面的指令:
When the thread wants to return to user mode, it executes the following instructions:
1. Change mode from kernel to user.
2. Load the program counter from the top of the stack into the processor’s PC.
处理器不必将这些步骤作为前后操作来执行。在步骤 1 之后,处理器可以返回内核模式,例如,处理未完成的中断。
Processors don’t have to perform these steps as a before-or-after action. After step 1, it is fine for a processor to return to kernel mode, for example, to process an outstanding interrupt.
返回序列假定线程在调用SVC之前已将返回地址推送到其堆栈上。如果线程尚未执行此操作,则线程返回到用户模式时可能发生的最坏情况是它在某个任意地址恢复,这可能会导致线程失败(与许多其他编程错误一样),但它不会对具有KERNEL-ONLY权限的域造成问题,因为线程无法在用户模式下引用该域。
The return sequence assumes that a thread has pushed the return address on its stack before invoking SVC. If the thread hasn’t done so, then the worst that can happen when the thread returns to user mode is that it resumes at some arbitrary address, which might cause the thread to fail (as with many other programming errors), but it cannot create a problem for a domain with KERNEL-ONLY permission because the thread cannot refer to that domain in user mode.
进入和离开内核模式的区别在于,离开时,程序计数器中加载的值不是预定义值。相反,内核会将其设置为保存的地址。
The difference between entering and leaving kernel mode is that on leaving, the value loaded in the program counter isn’t a predefined value. Instead, the kernel sets it to the saved address.
门还可用于处理中断和异常。如果处理器遇到中断,处理器将进入一个特殊的中断门,门管理器将根据中断源(时钟中断、权限错误、非法内存引用、除以零等)将中断分派到适当的处理程序。有些处理器对异常(例如权限错误)引起的错误有不同的门;其他处理器对中断(例如时钟中断)和异常各有一个门。
Gates can also be used to handle interrupts and exceptions. If the processor encounters an interrupt, the processor enters a special gate for interrupts and the gate manager dispatches the interrupt to the appropriate handler, based on the source of the interrupt (clock interrupt, permission error, illegal memory reference, divide by zero, etc.). Some processors have a different gate for errors caused by exceptions (e.g., a permission error); others have one gate for interrupts (e.g., clock interrupt) and exceptions.
问题集9探讨最小操作系统中用于设置模式和处理中断的硬件和软件之间的交互。
Problem set 9 explores in a minimal operating system the interactions between hardware and software for setting modes and handling interrupts.
图 5.6中的SEND和RECEIVE实现假设发送和接收线程共享有界缓冲区,例如使用共享域,如图5.12所示。此设置在线程的所有域之间强制设置边界,但包含共享缓冲区的域除外。线程可能会意外修改共享缓冲区,因为两个线程都具有对共享域的写权限。因此,一个线程中的错误可能会间接影响另一个线程;我们希望避免这种情况并强制有界缓冲区的模块化。
The implementation of SEND and RECEIVE in Figure 5.6 assumes that the sending and receiving threads share the bounded buffer, using, for example, a shared domain, as shown in Figure 5.12. This setup enforces a boundary between all domains of the threads, except for the domain containing the shared buffer. A thread can modify the shared buffer accidentally because both threads have write permissions to the shared domain. Thus, an error in one thread could indirectly affect the other thread; we would like to avoid that and enforce modularity for the bounded buffer.
我们也可以通过将缓冲区放在共享内核域中来保护共享有界缓冲区(参见图 5.13)。现在,线程无法在用户模式下直接写入共享缓冲区。线程必须转换到内核模式才能将消息复制到共享缓冲区中。在此设计中,SEND和RECEIVE是管理程序调用。当线程调用SEND时,它会切换到内核模式并将消息从发送线程的域复制到共享缓冲区中。当接收线程调用RECEIVE时,它会切换到内核模式并将消息从共享缓冲区复制到接收线程的域中。只要在内核模式下运行的程序编写得当,此设计就能提供更强大的强制模块化,因为用户模式下的线程无法直接访问有界缓冲区的消息。
We can protect the shared bounded buffer, too, by putting the buffer in a shared kernel domain (see Figure 5.13). Now the threads cannot directly write the shared buffer in user mode. The threads must transition to kernel mode to copy messages into the shared buffer. In this design, SEND and RECEIVE are supervisor calls. When a thread invokes SEND, it changes to kernel mode and copies the message from the sending thread’s domain into the shared buffer. When the receiving thread invokes RECEIVE, it changes to kernel mode and copies a message from the shared buffer into the receiving’s domain. As long as the program that is running in kernel mode is written carefully, this design provides stronger enforced modularity because threads in user mode have no direct access to a bounded buffer’s messages.
图 5.13具有内核域的线程,其中包含共享缓冲区、内存管理器和域表。
Figure 5.13 Threads with a kernel domain containing the shared buffer, the memory manager, and the domain table.
这种更强大的强制模块化是以执行SEND和RECEIVE的主管调用为代价的。这种代价可能很大,因为用户模式和内核模式之间的转换可能很昂贵。原因是处理器通常会在其管道和缓存中维护状态作为加速机制。在用户-内核模式转换时,可能必须刷新或使此状态无效,否则处理器可能会错误地执行仍在管道中的指令。
This stronger enforced modularity comes at a performance cost for performing supervisor calls for SEND and RECEIVE. This cost can be significant because transitions between user mode and kernel mode can be expensive. The reason is that a processor typically maintains state in its pipeline and a cache as a speedup mechanism. This state may have to be flushed or invalidated on a user-kernel mode transition because otherwise the processor may execute instructions that are still in the pipeline incorrectly.
研究人员已经提出了一些降低性能成本的技术,例如通过优化SEND和RECEIVE管理程序调用的内核代码路径、通过组合发送和接收的调用、通过巧妙设置域以避免复制大消息的成本、通过处理器寄存器传递小参数、通过选择合适的数据结构布局来降低用户内核模式转换的成本等等[进一步阅读建议 6.2.1,6.2.2 和 6.2.3]。问题集7说明了一个轻量级远程过程调用的实现。
Researchers have come up with techniques to reduce the performance cost by optimizing the kernel code paths for the SEND and RECEIVE supervisor calls, by having a combined call that sends and receives, by cleverly setting up domains to avoid the cost of copying large messages, by passing small arguments through processor registers, by choosing a suitable layout of data structures that reduces the cost of user-kernel mode transitions, and so on [Suggestions for Further Reading 6.2.1, 6.2.2, and 6.2.3]. Problem set 7 illustrates a lightweight remote procedure call implementation.
在内核模式下运行的模块集合通常称为内核程序,或简称为内核。一个问题是内核和第一个域是如何产生的。边栏 5.3详细介绍了处理器如何在禁用域检查的情况下以内核模式启动,内核如何引导第一个域,以及内核如何创建用户级域。
The collection of modules running in kernel mode is usually called the kernel program, or kernel for short. A question that arises is how the kernel and the first domain come into existence. Sidebar 5.3 details how a processor starts in kernel mode with domain checking disabled, how the kernel can then bootstrap the first domain, and how the kernel can create user-level domains.
内核是一个可信中介,因为它是唯一能够执行特权指令(如存储到处理器的域寄存器)的程序,并且应用程序模块依赖内核才能正常运行。由于内核必须是内存管理器硬件的可信中介,许多设计人员还将内核设为所有其他共享设备(如时钟、显示器和磁盘)的可信中介。管理这些设备的模块必须是内核程序的一部分。在这种设计中,窗口管理器模块、网络管理器模块和文件管理器模块在内核模式下运行。这种内核设计(其中大多数操作系统都在内核模式下运行)称为单片内核(见图5.14)。
The kernel is a trusted intermediary because it is the only program that can execute privileged instructions (such as storing to a processor’s domain registers) and the application modules rely on the kernel to operate correctly. Because the kernel must be a trusted intermediary for the memory manager hardware, many designers also make the kernel the trusted intermediary for all other shared devices, such as the clock, display, and disk. Modules that manage these devices must then be part of the kernel program. In this design, the window manager module, network manager module, and file manager module run in kernel mode. This kernel design, in which most of the operating system runs in kernel mode, is called a monolithic kernel (see Figure 5.14).
图5.14单片组织:内核实现操作系统。
Figure 5.14 Monolithic organization: the kernel implements the operating system.
边栏 5.3 引导操作系统
Sidebar 5.3 Bootstrapping an Operating System
当用户打开计算机电源时,处理器会将所有寄存器设置为零;因此,用户模式位处于关闭状态。处理器执行的第一条指令是地址 0 处的指令(pc 寄存器的值)。因此,在重置后,处理器会从地址 0 获取其第一条指令。
When the user switches on the power for the computer, the processor starts with all registers set to zero; thus, the user-mode bit is off. The first instruction the processor executes is the instruction at address 0 (the value of the pc register). Thus after a reset, the processor fetches its first instruction from address 0.
地址 0 通常对应于只读存储器 (ROM)。此存储器包含一些初始代码、引导代码、基本内核程序,该程序从磁盘加载完整的内核程序。计算机制造商将引导程序刻录到只读存储器中,此后引导程序无法更改。引导程序包括一个基本文件系统,该系统在磁盘上预先约定的位置找到内核程序(可能是由软件制造商编写的)。引导代码将内核读入物理内存并跳转到内核的第一条指令。
Address 0 typically corresponds to a read-only memory (ROM). This memory contains some initial code, the boot code, a rudimentary kernel program, which loads the full kernel program from a magnetic disk. The computer manufacturer burns into the read-only memory the boot program, after which the boot program cannot be changed. The boot program includes a rudimentary file system, which finds the kernel program (probably written by a software manufacturer) at a pre-agreed location on disk. The boot code reads the kernel into physical memory and jumps to the first instruction of the kernel.
通过小型引导程序引导内核提供了模块化。硬件和软件制造商可以独立开发其产品,用户可以更改内核(例如,升级到较新的版本或使用不同的内核供应商),而无需修改其硬件。
Bootstrapping the kernel through a small boot program provides modularity. The hardware and software manufacturers can develop their products independently, and users can change kernels, for example, to upgrade to a newer version or to use a different kernel vendor, without having to modify their hardware.
有时,引导过程有多个层次,以处理额外的限制。例如,第一个引导加载程序可能只能加载单个块,而这个块可能太小,无法容纳基本的内核程序。在这种情况下,引导代码可能会先加载一个更简单的内核程序,然后再加载基本的内核程序,最后再加载内核程序。
Sometimes there are multiple layers of booting to handle additional constraints. For example, the first boot loader may be able to load only a single block, which can be too small to hold the rudimentary kernel program. In such cases, the boot code may load first an even simpler kernel program, which then loads the rudimentary kernel program, which then loads the kernel program.
一旦运行,内核就会为自己分配一个线程。此线程分配涉及分配一个域用作堆栈,以便它可以调用过程调用,从而允许使用高级语言编写内核的其余部分。它还可能分配几个其他域,例如,一个用于域表的域。
Once it is running, the kernel allocates a thread for itself. This thread allocation involves allocating a domain for use as a stack so that it can invoke procedure calls, allowing the rest of the kernel to be written in a high-level language. It may also allocate a few other domains, for example, one for the domain table.
内核初始化后,通常会创建一个或多个线程来运行非内核服务。它为每个服务分配一个或多个域(例如,一个用于程序文本,一个用于堆栈,一个用于数据)。内核通常会将非内核服务的程序文本和数据预加载到某些域中。定位程序文本和数据的常见解决方案是假设第一个非内核程序(如内核程序)位于磁盘上的预定义地址或内核程序数据的一部分。
Once the kernel has initialized, it typically creates one or more threads to run non-kernel services. It allocates to each service one or more domains (e.g., one for program text, one for a stack, and one for data). The kernel typically preloads some of the domains with the program text and data of the non-kernel services. A common solution to locating the program text and data is to assume that the first non-kernel program, like the kernel program, is at a predefined address on the magnetic disk or part of the data of the kernel program.
一旦线程在用户模式下运行,它就可以使用内核程序的门重新进入内核。使用内核程序,用户级线程可以创建更多线程,为这些线程分配域,并在完成后退出。
Once thread is running in user mode, it can reenter the kernel using a gate for a kernel procedure. Using the kernel procedures, the user-level thread can create more threads, allocate domains for these threads, and, when done, exit.
用户模式下的线程错误(例如,除以零或使用不在线程域中的地址或违反权限)会导致异常,从而将处理器更改为内核模式。然后异常处理程序可以清理线程。
Errors by threads in user mode (e.g., dividing by zero or using an address that is not in the thread’s domains or violates permissions) cause an exception, which changes the processor to kernel mode. The exception handler can then clean up the thread.
在内核模式下,诸如除以零之类的错误是致命的,会导致计算机停机,因为这些错误通常是由内核程序中的编程错误引起的,并且没有简单的方法可以恢复。由于内核错误是致命的,因此我们必须小心地编程和构建内核。
In kernel mode, errors such as dividing by zero are fatal and halt the computer because these errors are typically caused by programming mistakes in the kernel program, from which there is no easy way to recover. Since kernel errors are fatal, we must program and structure the kernel carefully.
我们希望内核保持较小,因为程序中的错误数量至少与程序的大小成正比——有些人甚至认为与程序大小的平方成正比。在单片内核中,如果文件管理器模块的程序员犯了错误,文件管理器模块可能会覆盖与文件系统无关的内核数据结构,从而导致内核的不相关部分失败。
We would like to keep the kernel small because the number of bugs in a program is at least proportional to the size of a program—and some even argue to the square of the size of program. In a monolithic kernel, if the programmer of the file manager module has made an error, the file manager module may overwrite kernel data structures unrelated to the file system, thus causing unrelated parts of the kernel to fail.
微内核架构以客户端/服务方式构建操作系统本身(见图5.15)。通过将强制模块化的思想应用于操作系统本身,我们可以避免单片组织的一些主要问题。在微内核架构中,系统模块在用户模式下在其自己的域中运行,而不是作为单片内核的一部分。微内核本身实现了一组最小的抽象,主要是包含模块的域、运行程序的线程和允许模块相互发送消息的虚拟通信链路。本章中描述的内核及其接口如表 5.1所示,是微内核的一个示例。
The microkernel architecture structures the operating system itself in a client/service style (see Figure 5.15). By applying the idea of enforced modularity to the operating system itself, we can avoid some of the major problems of a monolithic organization. In the microkernel architecture, system modules run in user mode in their own domain, as opposed to being part of a monolithic kernel. The microkernel itself implements a minimal set of abstractions, primarily domains to contain modules, threads to run programs, and virtual communication links to allow modules to send messages to one another. The kernel described in this chapter with its interface shown in Table 5.1 is an example of a microkernel.
图5.15微内核组织:使用客户端/服务模型组织的操作系统。
Figure 5.15 Microkernel organization: the operating system organized using the client/service model.
例如,在微内核组织中,窗口服务模块在自己的域中运行,可以访问显示器;文件服务模块在自己的域中运行,可以访问磁盘扩展;数据库服务在自己的域中运行,具有自己的磁盘扩展。服务的客户端通过调用远程过程调用与它们通信,远程过程调用的存根又调用 SEND和RECEIVE管理程序调用。Hansen 提出了一种早期、简洁的微内核设计 [进一步阅读建议 5.1.1 ]。
In the microkernel organization, for example, the window service module runs in its own domain with access to the display, the file service module runs in its own domain with access to a disk extent, and the database service runs in its own domain with its own disk extent. Clients of the services communicate with them by invoking remote procedure calls, whose stubs in turn invoke the SEND and RECEIVE supervisor calls. An early, clean design for a microkernel is presented by Hansen [Suggestions for Further Reading 5.1.1].
微内核组织的一个好处是错误包含在模块中,从而简化了调试。文件服务模块中的编程错误仅影响文件服务模块;其他模块的内部数据结构不会因为文件服务模块程序员的错误而被无意修改。如果文件服务出现故障,文件服务的程序员可以专注于调试文件服务并立即排除其他服务。与单片内核方法相比,很难将内核中的错误归咎于特定模块,因为模块之间并不相互隔离,一个模块中的错误可能是由另一个模块中的缺陷引起的。
A benefit of the microkernel organization is that errors are contained within a module, simplifying debugging. A programming error in the file service module affects only the file service module; no other module ever has its internal data structures unintentionally modified because of an error by the programmer of the file service module. If the file service fails, a programmer of the file service can focus on debugging the file service and rule out the other services immediately. In contrast with the monolithic kernel approach, it is difficult to attribute an error in the kernel to a particular module because the modules aren’t isolated from each other and an error in one module may be caused by a flaw in another module.
此外,如果文件服务发生故障,数据库服务可能能够继续运行。当然,如果文件服务模块发生故障,其客户端将无法运行,但它们可能能够调用恢复程序来修复损坏并重新启动文件服务。在单片内核方法中,如果文件服务发生故障,内核通常也会发生故障,整个操作系统必须重新启动。
In addition, if the file service fails, the database service may be able to continue operating. Of course, if the file service module fails, its clients cannot operate, but they may be able to invoke a recovery procedure that repairs the damage and restarts the file service. In the monolithic kernel approach if the file service fails, the kernel usually fails too, and the entire operating system must reboot.
很少有广泛使用的操作系统以最纯粹的形式实现微内核方法。事实上,当今大多数广泛使用的操作系统都有一个单片内核。许多关键服务在内核内运行,只有少数服务在内核外运行。例如,在 GNU/Linux 操作系统中,文件和网络服务在内核模式下运行,但 X Window System 在用户模式下运行。
Few widely used operating systems implement the microkernel approach in its purest form. In fact, most widely used operating systems today have a mostly monolithic kernel. Many critical services run inside the kernel, and only a few run outside the kernel. For example, in the GNU/Linux operating system the file and the network service run in kernel mode, but the X Window System runs in user mode.
单片操作系统占据主导地位有几个原因。首先,如果某个服务(例如文件服务)对于操作系统的运行至关重要,那么它在用户模式或内核模式下失败都无关紧要;无论是哪种情况,系统都无法使用。
Monolithic operating systems dominate the field for several reasons. First, if a service (e.g., a file service) is critical to the functioning of the operating system, it doesn’t matter much if it fails in user mode or in kernel mode; in either case, the system is unusable.
其次,有些服务是在许多模块之间共享的,将这些服务作为内核程序的一部分来实现会更容易,因为内核程序已经在所有模块之间共享。例如,最近访问的文件数据的缓存在所有模块之间共享时会更有效。此外,此缓存可能需要与内存管理器(通常是内核的一部分)协调其内存使用。
Second, some services are shared among many modules, and it can be easier to implement these services as part of the kernel program, which is already shared among all modules. For example, a cache of recently accessed file data is more effective when shared among all modules. Furthermore, this cache may need to coordinate its memory use with the memory manager, which is typically part of the kernel.
第三,某些服务的性能至关重要,SEND和RECEIVE主管调用的开销可能太大,无法将子系统拆分为更小的模块并分离每个模块。
Third, the performance of some services is critical, and the overhead of SEND and RECEIVE supervisor calls may be too large to split subsystems into smaller modules and separate each module.
第四,如果单片内核配备了良好的内核调试工具,那么单片系统可以享受调试微内核系统的便利。
Fourth, monolithic systems can enjoy the ease of debugging microkernel systems if the monolithic kernel comes with good kernel debugging tools.
第五,重组现有内核程序可能很困难。特别是,几乎没有动力去改变已经运行的内核程序。如果系统运行正常,并且大多数错误都已消除,那么微内核的调试优势就会开始消失,SEND和RECEIVE管理程序调用的成本开始占主导地位。
Fifth, it may be difficult to reorganize existing kernel programs. In particular, there is little incentive to change a kernel program that already works. If the system works and most of the errors have been eradicated, the debugging advantage of microkernels begins to evaporate, and the cost of SEND and RECEIVE supervisor calls begins to dominate.
一般来说,如果可以在一个可运行的系统和一个设计更好但全新的系统之间做出选择,除非新系统更好,否则人们不会愿意切换到新系统。原因之一是切换的开销:学习新设计、重新设计旧系统以使用新设计、重新发现未记录的假设以及发现未实现的假设(大型系统通常由于尚未完全理解的原因而运行)。另一个原因是切换收益的不确定性。除非有来自该领域的证据,否则关于更好设计的说法都是推测性的。就操作系统而言,几乎没有实验证据表明基于微内核的系统比现有的单片内核更强大。最后一个原因是存在机会成本:人们可以花时间重新设计现有软件,也可以花时间开发现有软件以满足新需求。出于这些原因,很少有系统切换到纯微内核设计。相反,许多现有系统仍使用单片内核,也许运行的服务对性能的要求不像用户模式程序那么高。微内核设计存在于更专业的领域,并且对微内核的研究持续活跃。
In general, if one has the choice between a working system and a better designed, but new system, one doesn’t want to switch over to the new system unless it is much better. One reason is the overhead of switching: learning the new design, reengineering the old system to use the new design, rediscovering undocumented assumptions, and discovering unrealized assumptions (large systems often work for reasons that weren’t fully understood). Another reason is the uncertainty of the gain of switching. Until there is evidence from the field, the claims about the better design are speculative. In the case of operating systems, there is little experimental evidence that microkernel-based systems are more robust than existing monolithic kernels. A final reason is that there is an opportunity cost: one can spend time reengineering existing software, or one can spend time developing the existing software to address new needs. For these reasons, few systems have switched to a pure microkernel design. Instead many existing systems have stayed with monolithic kernels, perhaps running services that are not as performance critical as user-mode programs. Microkernel designs exist in more specialized areas, and research on microkernels continues to be active.
为了一次解决一个问题,上一节假设内存及其地址空间非常大,大到足以容纳所有域。实际上,内存和地址空间是有限的。因此,当程序员调用ALLOCATE_DOMAIN时,我们希望程序员指定一个合理的大小。如果指定的大小太小,为了让程序能够扩大其域,我们可以为程序员提供一个额外的原语GROW_DOMAIN。
To address one problem at a time, the previous section assumed that memory and its address space is very large, large enough to hold all domains. In practice, memory and address space are limited. Thus, when a programmer invokes ALLOCATE_DOMAIN, we would like the programmer to specify a reasonable size. To allow a program to grow its domain if the specified size turns out to be too small, we could offer the programmer an additional primitive GROW_DOMAIN.
但是,增加域会产生内存管理问题。例如,假设程序 A 分配域 1,程序 2 在域 1 之后立即分配域 2。即使域 2 之后有可用内存,程序 A 也无法增加域 1,因为它会跨入域 2。在这种情况下,程序 A 唯一的选择是分配一个所需大小的新域,将域 1 的内容复制到新域中,将程序中引用域 1 地址的地址更改为引用新域 2 中的相应地址,然后释放域 1。
Growing domains, however, creates memory management problems. For example, assume that program A allocates domain 1 and program 2 allocates domain 2, right after domain 1. Even if there is free memory after domain 2, program A cannot grow domain 1 because it would cross into domain 2. In this case, the only option left for program A is to allocate a new domain of the desired size, copy the contents of domain 1 into the new domain, change the addresses in the program that refer to addresses in domain 1 to instead refer to corresponding addresses in the new domain 2, and deallocate domain 1.
这种内存管理使编写程序变得复杂,并且由于内存复制而使程序运行缓慢。为了减轻管理内存的编程负担,大多数现代计算机系统都对内存进行了虚拟化,这一步骤提供了两个功能:
This memory management complicates writing programs and can make programs slow because of the memory copies. To reduce the programming burden of managing memory, most modern computer systems virtualize memory, a step that provides two features:
1.虚拟地址。如果程序使用虚拟地址来寻址内存,并且内存管理器动态地将虚拟地址转换为物理地址,则内存管理器可以在程序背后扩大和移动内存中的域。
1. Virtual addresses. If programs address memory using virtual addresses and the memory manager translates the virtual addresses to physical addresses on the fly, then the memory manager can grow and move domains in memory behind the program’s back.
2.虚拟地址空间。单个地址空间可能不足以同时容纳所有应用程序的所有地址。例如,单个大型数据库程序本身可能需要硬件中可用的所有地址空间。如果我们可以创建虚拟地址空间,那么我们可以为每个程序提供自己的地址空间。此扩展还允许线程将其程序加载到其选择的地址(例如地址 0)。
2. Virtual address spaces. A single address space may not be large enough to hold all addresses of all applications at the same time. For example, a single large database program by itself may need all the address space available in the hardware. If we can create virtual address spaces, then we can give each program its own address space. This extension also allows a thread to have its program loaded at an address of its choosing (e.g., address 0).
虚拟化内存的内存管理器称为虚拟内存管理器。本节中我们设计的设计取代了域管理器,但结合了域的主要特性:受控共享和权限。我们分两步描述虚拟内存设计。对于第一步,第 5.4.1 节和5.4.2节介绍了虚拟地址并描述了一种有效的转换方法。对于第二步,第 5.4.3 节介绍了虚拟地址空间。第 5.4.4 节讨论了实现虚拟内存管理器的软件和硬件方面的权衡。最后,本节以高级虚拟内存设计结束。
A memory manager that virtualizes memory is called a virtual memory manager. The design we work out in this section replaces the domain manager but incorporates the main features of domains: controlled sharing and permissions. We describe the virtual memory design in two steps. For the first step, Sections 5.4.1 and 5.4.2 introduce virtual addresses and describe an efficient way to translate them. For the second step, Section 5.4.3 introduces virtual address spaces. Section 5.4.4 discusses the trade-offs of software and hardware aspects of implementing a virtual memory manager. Finally, the section concludes with an advanced virtual memory design.
虚拟内存管理器将处理两种类型的地址,因此给它们命名很方便。线程在读取和写入内存时会发出虚拟地址(见图5.16)。内存管理器将处理器发出的每个虚拟地址转换为物理地址、内存中某个位置的总线地址或设备控制器上的寄存器。
The virtual memory manager will deal with two types of addresses, so it is convenient to give them names. The threads issue virtual addresses when reading and writing to memory (see Figure 5.16). The memory manager translates each virtual address issued by the processor into a physical address, a bus address of a location in memory or a register on a controller of a device.
图 5.16虚拟内存管理器将虚拟地址转换为物理地址。
Figure 5.16 A virtual memory manager translating virtual addresses to physical addresses.
在使用地址时对其进行转换可提供设计灵活性。可以设计物理地址与虚拟地址宽度不同的计算机。内存管理器可以将多个虚拟地址转换为相同的物理地址,但可能具有不同的权限。内存管理器可以将虚拟地址分配给线程,但推迟分配物理内存,直到线程引用其中一个虚拟地址。
Translating addresses as they are being used provides design flexibility. One can design computers whose physical addresses have a different width than its virtual addresses. The memory manager can translate several virtual addresses to the same physical address, but perhaps with different permissions. The memory manager can allocate virtual addresses to a thread but postpone allocating physical memory until the thread makes a reference to one of the virtual addresses.
虚拟化地址利用了通过间接方式将模块解耦的设计原则。虚拟内存管理器通过将程序指令的虚拟地址转换为物理地址,而不是让程序直接发出物理内存地址,在处理器和内存系统之间提供了一个间接层。由于它控制从程序发出的地址到内存系统理解的地址的转换,因此虚拟内存管理器可以在不同时间将任何特定的虚拟地址转换为不同的物理内存地址。借助转换,虚拟内存管理器可以重新排列内存系统中的数据,而无需修改任何应用程序。
Virtualizing addresses exploits the design principle decouple modules with indirection. The virtual memory manager provides a layer of indirection between the processor and the memory system by translating the virtual addresses of program instructions into physical addresses, instead of having the program directly issue physical memory addresses. Because it controls the translation from the addresses issued by the program to the addresses understood by the memory system, the virtual memory manager can translate any particular virtual address to different physical memory addresses at different times. Thanks to the translation, the virtual memory manager can rearrange the data in the memory system without having to modify any application program.
从命名的角度来看,虚拟内存管理器在物理地址的命名空间之上创建了虚拟地址的命名空间。虚拟内存管理器的命名方案将虚拟地址转换为物理地址。
From a naming point of view, the virtual memory manager creates a name space of virtual addresses on top of a name space of physical addresses. The virtual memory manager’s naming scheme translates virtual addresses into physical addresses.
虚拟内存有许多用途。本章中,我们将重点介绍如何透明地管理物理内存。稍后,在第 6 章第6.2 节中,我们将介绍如何使用虚拟内存透明地模拟比计算机实际拥有的更大的内存。
Virtual memory has many uses. In this chapter, we focus on managing physical memory transparently. Later, in Section 6.2 of Chapter 6, we describe how virtual memory can also be used to transparently simulate a larger memory than the computer actually possesses.
要了解地址转换如何帮助内存管理,请考虑一个虚拟内存管理器,它的虚拟地址空间非常大(例如,2 64字节),但物理地址空间较小。假设一个线程分配了两个大小为 100 字节的域(参见图 5.17a)。内存管理器在物理内存中连续分配域,但在虚拟地址空间中,域彼此相距很远。(ALLOCATE_DOMAIN返回虚拟地址。)当线程引用虚拟地址时,虚拟内存管理器会将该地址转换为适当的物理地址。
To see how address translation can help memory management, consider a virtual memory manager with a virtual address space that is very large (e.g., 264 bytes) but with a physical address space that is smaller. Let’s assume that a thread has allocated two domains of size 100 bytes (see Figure 5.17a). The memory manager allocated the domains in physical memory contiguously, but in the virtual address space the domains are far away from each other. (ALLOCATE_DOMAIN returns a virtual address.) When a thread makes a reference to a virtual address, the virtual memory manager translates the address to the appropriate physical address.
图 5.17 (a) 一个线程分配了域 1 和 2;它们在虚拟内存中相距很远,但在物理内存中彼此相邻。 (b) 为了响应线程增加域 1 的请求,虚拟内存管理器透明地在物理内存中移动域 1,并调整从虚拟地址到物理地址的转换。
Figure 5.17 (a) A thread has allocated a domain 1 and 2; they are far apart in virtual memory but next to each other in physical memory. (b) In response to the thread’s request to grow domain 1, the virtual memory manager transparently moved domain 1 in physical memory and adjusted the translation from virtual to physical addresses.
现在考虑线程请求将域 1 从 8 KB 增加到 16 KB。如果没有虚拟地址,内存管理器将拒绝此请求,因为域无法在物理内存中增长,否则会进入域 2。但是,使用虚拟地址(参见图 5.17b),内存管理器可以在虚拟地址空间中增长域,分配请求的物理内存量,将域 1 的内容复制到新分配的物理内存中,并更新其对域 1 的映射。使用虚拟地址,应用程序不必知道内存管理器移动了其域的内容以使其增长。
Now consider the thread requesting to grow domain 1 from size 8 kilobytes to, say, 16 kilobytes. Without virtual addresses, the memory manager would deny this request because the domain cannot grow in physical memory without running into domain 2. With virtual addresses (see Figure 5.17b), however, the memory manager can grow the domain in the virtual address space, allocate the requested amount of physical memory, copy the content of domain 1 into the newly allocated physical memory, and update its mapping for domain 1. With virtual addresses, the application doesn’t have to be aware that the memory manager moved the contents of its domain in order to grow it.
即使忽略复制域内容的成本,引入虚拟地址也会以复杂性和性能为代价。除了物理地址空间外,内存管理器还必须管理虚拟地址。它必须分配和释放它们(如果虚拟地址空间不大),它必须设置虚拟地址和物理地址之间的转换,等等。转换是即时发生的,这可能会减慢内存引用速度。本节的其余部分将研究这些问题,并提出一个在扩展域时甚至不需要复制域的完整内容的计划。
Even ignoring the cost of copying the content of a domain, introducing virtual addresses comes at a cost in complexity and performance. The memory manager must manage virtual addresses in addition to a physical address space. It must allocate and deallocate them (if the virtual address space isn’t large), it must set up translations between virtual and physical addresses, and so on. The translation happens on-the-fly, which may slow down memory references. The rest of this section investigates these issues and presents a plan that doesn’t even require copying the complete content of a domain when growing the domain.
将虚拟地址转换为物理地址的一种简单方法是维护一个映射,该映射记录了每个虚拟地址对应的物理地址。当然,维护此映射所需的内存量会很大。如果每个物理地址都是一个字(8 字节),并且地址空间有 2 64个虚拟地址,那么我们可能需要 2 72字节的物理内存来存储映射。
A naïve way of translating virtual addresses into physical addresses is to maintain a map that for each virtual address records its corresponding physical address. Of course, the amount of memory required to maintain this map would be large. If each physical address is a word (8 bytes) and the address space has 264 virtual addresses, then we might need 272 bytes of physical memory just to store the mapping.
一种更有效的转换方法是使用页面映射。页面映射是页面映射条目的数组。每个条目将固定大小的连续字节虚拟地址范围(称为页面)转换为物理地址范围(称为块),用于保存页面。目前,内存管理器维护单个页面映射,以便所有线程像以前一样共享单个虚拟地址空间。
A more efficient way of translation is using a page map. The page map is an array of page map entries. Each entry translates a fixed-sized range of contiguous bytes of virtual addresses, called a page, to a range of physical addresses, called a block, which holds the page. For now, the memory manager maintains a single page map, so that all threads share the single virtual address space, as before.
通过这种组织,我们可以将线程看到的内存视为一组连续的页面。虚拟地址是一个结构化的名称,由两部分组成:页号和该页内的字节偏移量(参见图 5.18)。页号唯一地标识页面图中的条目,从而标识页面,字节偏移量标识该页面内的字节。(如果处理器提供字寻址而不是字节寻址,则偏移量将指定页内的字。)页面的大小(以字节为单位)等于可存储在虚拟地址的字节偏移量字段中的不同值的最大数量。如果偏移量字段为 12 位宽,则页面包含 4,096(2 12)个字节。
With this organization, we can think of the memory that threads see as a set of contiguous pages. A virtual address then is a name overloaded with structure consisting of two parts: a page number and a byte offset within that page (see Figure 5.18). The page number uniquely identifies an entry in the page map, and thus a page, and the byte offset identifies a byte within that page. (If the processor provides word addressing instead of byte addressing, offset would specify the word within a page.) The size of a page, in bytes, is equal to the maximum number of different values that can be stored in the byte offset field of the virtual address. If the offset field is 12 bits wide, then a page contains 4,096 (212) bytes.
图 5.18通过将页号转换为块号来转换虚拟地址的虚拟内存管理器。
Figure 5.18 A virtual memory manager that translates virtual addresses by translating page numbers to block numbers.
通过这种安排,虚拟内存管理器在页面映射中记录每个页面包含该页面的物理内存块。我们可以将块视为页面的容器。然后,物理内存是一组连续的块,用于保存页面,但页面在物理内存中不必连续;也就是说,块 0 可以保存页面 100,块 1 可以保存页面 2,依此类推。页面与保存它们的块之间的映射可以是任意的。
With this arrangement, the virtual memory manager records in the page map, for each page, the block of physical memory that contains that page. We can think of a block as the container of a page. Physical memory is then a contiguous set of blocks, holding pages, but the pages don’t have to be contiguous in physical memory; that is, block 0 may hold page 100, block 1 may hold page 2, and so forth. The mapping between pages and the blocks that hold them can be arbitrary.
页面图简化了内存管理,因为内存管理器可以在物理内存中的任何位置分配一个块并将适当的映射插入到页面图中,而无需复制物理内存中的域来合并可用空间。
The page map simplifies memory management because the memory manager can allocate a block anywhere in physical memory and insert the appropriate mapping into the page map, without having to copy domains in physical memory to coalesce free space.
物理地址也可以看作由两部分组成:唯一标识内存块的块号和标识该块内字节的偏移量。将虚拟地址转换为物理地址现在分为两个步骤:
A physical address can also be viewed as having two parts: a block number that uniquely identifies a block of memory and an offset that identifies a byte within that block. Translating a virtual address to a physical address is now a two-step process:
1.虚拟内存管理器通过从页号到块号的某种映射,将虚拟地址的页号转换为保存该页面的块号。
1. The virtual memory manager translates the page number of the virtual address to a block number that holds that page by means of some mapping from page numbers to block numbers.
2.虚拟内存管理器通过将块号与原始虚拟地址的字节偏移量连接起来来计算物理地址。
2. The virtual memory manager computes the physical address by concatenating the block number with the byte offset from the original virtual address.
页图可以有几种不同的表示形式,每种表示形式在地址转换方面都有各自的权衡。页图的最简单实现是数组实现,通常称为页表。当大多数页面都有关联块时,它很适用。图 5.19演示了如何使用以线性页表实现的页图。虚拟内存管理器通过从虚拟地址中获取页号并将其用作页表中的索引来查找相应的块号,从而将虚拟地址解析为物理地址。然后,管理器通过将字节偏移量与页表条目中的块号连接起来来计算物理地址。最后,它将这个物理地址发送到物理内存。
Several different representations are possible for the page map, each with its own set of trade-offs for translating addresses. The simplest implementation of a page map is an array implementation, often called a page table. It is suitable when most pages have an associated block. Figure 5.19 demonstrates the use of a page map implemented as a linear page table. The virtual memory manager resolves virtual addresses into physical addresses by taking the page number from the virtual address and using it as an index into the page table to find the corresponding block number. Then, the manager computes the physical address by concatenating the byte offset with the block number found in the page-table entry. Finally, it sends this physical address to the physical memory.
图 5.19使用页表的虚拟内存管理器的实现。
Figure 5.19 An implementation of a virtual memory manager using a page table.
如果页面大小为 2 12字节,虚拟地址为 64 位宽,则线性页表很大(2 52 × 52 位)。因此,在实践中,设计人员使用更高效的页面映射表示,例如两级页面映射或倒置页面映射(即,通过物理地址而不是虚拟地址进行索引),但这些设计超出了本文的范围。
If the page size is 212 bytes and virtual addresses are 64 bits wide, then a linear page table is large (252 × 52 bits). Therefore, in practice, designers use a more efficient representation of a page map, such as a two-level one or an inverted one (i.e., indexed by physical address instead of virtual), but these designs are beyond the scope of this text.
为了能够执行转换,虚拟内存管理器必须有一种查找和存储页表的方法。在通常的实现中,页表存储在保存页面的同一物理内存中,而页图基址的物理地址存储在保留的处理器寄存器中,通常称为页图地址寄存器。为了确保用户级线程无法直接更改转换并绕过强制模块化,处理器设计允许线程仅在内核模式下写入页图地址寄存器,并仅允许内核直接修改页表。
To be able to perform the translation, the virtual memory manager must have a way of finding and storing the page table. In the usual implementation, the page table is stored in the same physical memory that holds the pages, and the physical address of the base of the page map is stored in a reserved processor register, typically named the page-map address register. To ensure that user-level threads cannot change translation directly and bypass enforced modularity, processor designs allow threads to write the page-map address register only in kernel mode and allow only the kernel to modify the page table directly.
图 5.20显示了内核如何使用页面映射的示例。内核已在地址 0 处的物理内存中分配了一个页面映射。页面映射为模块提供了连续的通用地址空间,而无需强制内核为域连续分配内存块。在此示例中,块 100 包含第 12 页,块 500 包含第 13 页。当线程请求新域或扩大现有域时,内核可以分配任何未使用的块并将其插入到页面映射中。页面映射提供的间接级别允许内核透明地执行此操作 - 正在运行的线程不知道。
Figure 5.20 shows an example of how a kernel could use the page map. The kernel has allocated a page map in physical memory at address 0. The page map provides modules with a contiguous universal address space, without forcing the kernel to allocate blocks of memory for a domain contiguously. In this example, block 100 contains page 12 and block 500 contains page 13. When a thread asks for a new domain or to grow an existing domain, the kernel can allocate any unused block and insert it in the page map. The level of indirection provided by the page map allows the kernel to do this transparently—the running threads are unaware.
图 5.20使用页表的虚拟内存管理器。页表位于物理地址 0。它将页面(例如 12)映射到块(例如 100)。
Figure 5.20 A virtual memory manager using a page table. The page table is located at physical address 0. It maps pages (e.g., 12) to blocks (e.g., 100).
到目前为止,设计都假设所有线程共享一个虚拟地址空间,该空间足够大,可以容纳所有活动模块及其数据。许多处理器的虚拟地址空间太小,无法做到这一点。例如,许多处理器使用 32 位宽的虚拟地址,因此只有 2 32个地址,代表 4 GB 的地址空间。这可能勉强够容纳大型数据库中最常用的部分,几乎没有空间容纳其他模块。我们可以通过虚拟化物理地址空间来消除这一假设。
The design so far has assumed that all threads share a single virtual address space that is large enough that it can hold all active modules and their data. Many processors have a virtual address space that is too small to do that. For example, many processors use virtual addresses that are 32 bits wide and thus have only 232 addresses, which represent 4 gigabytes of address space. This might be barely large enough to hold the most frequently used part of a large database, leaving little room for other modules. We can eliminate this assumption by virtualizing the physical address space.
虚拟地址空间为每个应用程序提供了一种幻觉,即它拥有一个完整的地址空间。虚拟内存管理器可以通过为每个虚拟地址空间提供自己的页面映射来实现虚拟地址空间。支持多个虚拟地址空间的内存管理器可能具有以下接口:
A virtual address space provides each application with the illusion that it has a complete address space to itself. The virtual memory manager can implement a virtual address space by giving each virtual address space its own page map. A memory manager supporting multiple virtual address spaces may have the following interface:
id ← CREATE_ADDRESS_SPACE ():创建一个新的地址空间。此地址空间为空,这意味着其虚拟页面均未映射到实际内存。CREATE_ADDRESS_SPACE返回该地址空间的标识符。
id ← CREATE_ADDRESS_SPACE (): create a new address space. This address space is empty, meaning that none of its virtual pages are mapped to real memory. CREATE_ADDRESS_SPACE returns an identifier for that address space.
block ← ALLOCATE_BLOCK ():向内存管理器请求一块内存。管理器尝试分配一个未使用的块。如果没有空闲的块,请求失败。ALLOCATE_BLOCK返回该块的物理地址。
block ← ALLOCATE_BLOCK (): ask the memory manager for a block of memory. The manager attempts to allocate a block that is not in use. If there are no free blocks, the request fails. ALLOCATE_BLOCK returns the physical address of the block.
MAP ( id , block , page_number, permission ):将一个块放入id的地址空间。MAP将物理地址block映射到具有权限permission 的虚拟页面page_number。内存管理器在页映射中为地址空间id分配一个条目,将虚拟页面page_number映射到块block,并将页面的权限设置为permission。
MAP (id, block, page_number, permission): put a block into id’s address space. MAP maps the physical address block to virtual page page_number with permissions permission. The memory manager allocates an entry in the page map for address space id, mapping the virtual page page_number to block block, and setting the page’s permissions to permission.
UNMAP ( id , page_number ):从页面映射中删除page_number的条目,以便线程无法访问该页面及其关联块。引用已删除页面的指令是非法指令。
UNMAP (id, page_number): remove the entry for page_number from the page map so that threads have no access to that page and its associated block. An instruction that refers to a page that has been deleted is an illegal instruction.
FREE_BLOCK ( block ) :将块block添加到空闲内存块列表中。
FREE_BLOCK (block): add the block block to the list of free memory blocks.
DELETE_ADDRESS_SPACE ( id ):销毁地址空间。内存管理器释放页面映射及其地址空间id的块。
DELETE_ADDRESS_SPACE (id): destroy an address space. The memory manager frees the page map and its blocks of address space id.
使用此接口,线程可以分配自己的地址空间或与其他线程共享其地址空间。当程序员调用ALLOCATE_THREAD时,程序员指定线程要在其中运行的地址空间。在许多操作系统中,“进程”一词用于表示由一个或多个线程共享的单个虚拟地址空间的组合,但并不一致(参见边栏 5.4)。
Using this interface, a thread may allocate its own address space or share its address space with other threads. When a programmer calls ALLOCATE_THREAD, the programmer specifies the address space in which the thread is to run. In many operating systems, the word “process” is used for the combination of a single virtual address space shared by one or more threads, but not consistently (see Sidebar 5.4).
边栏 5.4 进程、线程和地址空间
Sidebar 5.4 Process, Thread, and Address Space
操作系统社区经常使用“进程”这个词,但多年来,这个概念已经有了足够多的变体,以至于当你读到或听到这个词时,你需要根据上下文猜测它的含义。在UNIX系统中(参见第 2.5 节),进程可能表示私有地址空间中的一个线程(如该系统的早期版本),或私有地址空间中的一组线程(如后续版本),或部分或全部共享的地址空间中的线程(或线程组)(如后续版本的UNIX ,也允许进程共享内存)。该术语的含义范围过于广泛,以至于用处不大,这就是为什么本文仅在早期版本的UNIX系统上下文中使用进程,而其他情况下仅使用线程和地址空间这两个核心概念。
The operating systems community uses the word “process” often, but over the years it has come up with enough variants on the concept that when you read or hear the word you need to guess its meaning from its context. In the UNIX system (see Section 2.5 ), a process may mean one thread in a private address space (as in the early version of that system), or a group of threads in a private address space (as in later versions), or a thread (or group of threads) in an address space that is partly or completely shared (as in later versions of UNIX that also allow processes to share memory). That range of meanings is so broad as to render the term less than useful, which is why this text uses process only in the context of the early version of the UNIX system and otherwise uses only the terms thread and address space, which are the two core concepts.
虚拟地址空间是线程的域,页面映射定义线程在物理内存中的位置。因此,内核不必维护带有域寄存器的单独域表。如果物理块未出现在地址空间的页面映射中,则线程无法引用该物理块。如果物理块出现在地址空间的页面映射中,则线程可以引用该物理块。如果物理块出现在两个页面映射中,则两个地址空间中的线程都可以引用该物理块,从而允许共享内存。
The virtual address space is a thread’s domain, and the page map defines how it resides in physical memory. Thus the kernel doesn’t have to maintain a separate domain table with domain registers. If a physical block doesn’t appear in an address space’s page map, then the thread cannot make a reference to that physical block. If a physical block appears in an address space’s page map, then the thread can make a reference to that physical block. If a physical block appears in two page maps, then threads in both address spaces can make references to that physical block, which allows sharing of memory.
内存管理器可以通过将权限位放置在页面映射条目中来支持域权限。例如,一个地址空间可能有一个映射了读取和写入权限的块,而另一个地址空间对该块只有读取权限。这种设计允许我们删除域寄存器,同时保留域的概念。
The memory manager can support domain permissions by placing the permission bits in the page-map entries. For example, one address space may have a block mapped with READ and WRITE permissions, while another address space has only READ permissions for that block. This design allows us to remove the domain registers, while keeping the concept of domains.
图 5.21说明了多个地址空间的使用。它描述了两个线程,每个线程都有自己的地址空间,但共享块 800。线程 A 和 B 的块 800 映射到第 12 页。(原则上,线程可以将块 800 映射到不同的虚拟地址,但这会使块中共享数据的命名变得复杂。)线程 A 映射块 800 具有READ权限,而线程 B 映射块 800 具有READ和WRITE权限。除了共享块之外,每个线程还有两个私有块。每个线程都有一个映射有READ和EXECUTE权限的块,例如,用于其程序文本,以及一个映射有READ和WRITE权限的块,例如,用于其堆栈。
Figure 5.21 illustrates the use of several address spaces. It depicts two threads, each with its own address space but sharing block 800. Threads A and B have block 800 mapped at page 12. (In principle, the threads could map block 800 at different virtual addresses, but that complicates naming of the shared data in the block.) Thread A maps block 800 with READ permission, while thread B maps block 800 with READ and WRITE permissions. In addition to a shared block, each thread has two private blocks. Each thread has a block mapped with READ and EXECUTE permissions for, for example, its program text and a block mapped with READ and WRITE permissions for, for example, its stack.
图 5.21每个线程都有自己的页面映射地址寄存器。线程 A 运行时页面映射存储在地址 300,而线程 B 运行时页面映射存储在地址 500。页面映射包含从页面 (p) 到块 (b) 的转换,以及所需的权限 (P)。
Figure 5.21 Each thread has its own page-map address register. Thread A runs with the page map stored at address 300, while B runs with the page map stored at address 500. The page map contains the translation from page (p) to block (b), and the permissions required (P).
为了支持虚拟地址空间,处理器的页面映射地址寄存器保存处理器上正在运行的线程的页面映射的物理地址,然后转换工作如下:
To support virtual address spaces, the page-map address register of a processor holds the physical address of the page map of the running thread on the processor and translation works then as follows:
1 过程 TRANSLATE (integer virtual, perm_needed ) 返回 physical_address
1 procedure TRANSLATE(integer virtual, perm_needed ) returns physical_address
2 page ← virtual [0:41] // 提取页码
2 page ← virtual[0:41] // Extract page number
3 offset ← virtual [42:63] // 提取偏移量
3 offset ← virtual[42:63] // Extract offset
4 page_table ← PMAR // 使用当前页表
4 page_table ← PMAR // Use the current page table
5 perm_page ← page_table [ page ].permissions // 查找页面的权限
5 perm_page ← page_table[page].permissions // Lookup permissions for page
6 如果 允许(perm_needed,perm_page)则
6 if PERMITTED (perm_needed, perm_page) then
7 block ← page_table [ page ].address // 索引到页面映射
7 block ← page_table[page].address // Index into page map
8 physical ← block + offset // 连接块和偏移量
8 physical ← block + offset // Concatenate block and offset
9 return physical //返回物理地址
9 return physical // Return physical address
10 否则返回错误
10 else return error
尽管通常在硬件中实现,但在伪代码形式中,我们可以将线性页表视为一个由页号索引并存储相应块号的数组。 第2行通过提取最左边的 42 位来提取页号page 。 (如边栏 4.3中所述,本书使用大端约定来编号位,并从零开始编号。)然后,它提取偏移量,即虚拟地址的最右边的 12 位(第3行)。 第4行从PMAR中的活动页面映射中读取地址。 第5行查找页面的权限。 如果使用虚拟所需的权限是页面权限的子集(第6行),则TRANSLATE使用page作为page_table的索引查找相应的块号,并通过将block与offset连接起来计算物理地址(第7和8行)。现在,虚拟内存管理器发出转换后的物理地址的总线请求,或者使用非法内存引用异常中断处理器。
Although usually implemented in hardware, in pseudocode form, we can view the linear page table as an array that is indexed by a page number and that stores the corresponding block number. Line 2 extracts the page number, page, by extracting the leftmost 42 bits. (As explained in Sidebar 4.3, this book uses the big-endian convention for numbering bits and begins numbering with zero.) Then, it extracts the offset, the 12 rightmost bits of the virtual address (line 3). Line 4 reads the address from the active page map out of PMAR. Line 5 looks up the permissions for the page. If the permissions necessary for using virtual are a subset of the permissions for the page (line 6), then TRANSLATE looks up the corresponding block number by using page as an index into page_table and computes the physical address by concatenating block with offset (lines 7 and 8). Now the virtual memory manager issues the bus request for the translated physical address or interrupts the processor with an illegal memory reference exception.
有两种方法可以为内核程序设置页面映射。第一种方法是让每个地址空间都包含内核到其地址空间的映射。例如,地址空间的上半部分可能包含内核,在这种情况下,下半部分包含用户程序。通过这种设置,从用户程序切换到内核(反之亦然)不需要更改处理器的页面映射地址寄存器;只需更改用户模式位。为了保护内核,内核将内核页面的权限设置为KERNEL-ONLY;在用户模式下,对内核页面执行STORE是非法指令。这种设计的另一个优点是,在内核模式下,内核很容易读取用户程序的数据结构,因为用户程序和内核共享相同的地址空间。这种设计的缺点是它减少了用户程序的可用地址空间,这在具有较小地址空间的传统架构中可能是一个问题。
There are two options for setting up page maps for the kernel program. The first option is to have each address space include a mapping of the kernel into its address space. For example, the top half of the address space might contain the kernel, in which case the bottom half contains the user program. With this setup, switching from the user program to the kernel (and vice versa) doesn’t require changing the processor’s page-map address register; only the user-mode bit must be changed. To protect the kernel, the kernel sets the permissions for kernel pages to KERNEL-ONLY; in user mode, performing a STORE to kernel pages is an illegal instruction. An additional advantage of this design is that in kernel mode, it is easy for the kernel to read data structures of the user program because the user program and kernel share the same address space. A disadvantage of this design is that it reduces the available address space for user programs, which could be a problem in a legacy architecture that has small address spaces.
第二种选择是让内存管理器为内核提供自己的单独地址空间,用户级线程无法访问该空间。要实现此选项,我们必须扩展SVC指令,以便在进入内核模式时将页面映射地址寄存器切换为内核的页面映射。同样,从内核模式返回用户模式时,内核必须将页面映射地址寄存器更改为进入门的线程的页面映射。
The second option is for the memory manager to give the kernel its own separate address space, which is inaccessible to user-level threads. To implement this option, we must extend the SVC instruction to switch the page-map address register to the kernel’s page map when entering kernel mode. Similarly, when returning from kernel mode to user mode, the kernel must change the page-map address register to the page map of the thread that entered the gate.
第二种方案将内核程序和用户程序完全分开,但作为内核一部分的内存管理器必须能够为用户程序创建新的地址空间,等等。简单的解决方案是将所有用户地址空间的页表包含在内核地址空间中。通过修改用户地址空间的页表,内存管理器可以修改该地址空间。由于页表小于其定义的地址空间,因此第二种方案比第一种方案浪费的地址空间更少。
The second option separates the kernel program and user programs completely, but the memory manager, which is part of the kernel, must be able to create new address spaces for user programs, and so on. The simple solution is to include the page tables of all user address spaces in the kernel address space. By modifying the page table for a user address space, the memory manager can modify that address space. Since a page table is smaller than the address space it defines, the second option wastes less address space than the first option.
如果内核程序和用户程序都有自己的地址空间,内核就无法使用内核虚拟地址来引用用户程序中的数据结构,因为这些虚拟地址引用的是内核地址空间中的位置。用户程序必须通过值将参数传递给主管调用,否则内核必须使用更复杂的方法将数据从用户地址空间复制到内核地址空间(反之亦然)。例如,内核可以使用该用户地址空间的页表来计算用户虚拟地址的物理地址,将计算出的物理地址映射到内核地址空间中未使用的地址,然后使用该地址。
If the kernel program and user programs have their own address spaces, the kernel cannot refer to data structures in user programs using kernel virtual addresses, since those virtual addresses refer to locations in the kernel address space. User programs must pass arguments to supervisor calls by value or the kernel must use a more involved method for copying data from a user address space to a kernel address space (and vice versa). For example, the kernel can compute the physical address for a user virtual address using the page table for that user address space, map the computed physical address into the kernel address space at an unused address, and then use that address.
在具有许多虚拟地址空间的设计中,虚拟地址相对于地址空间。此属性的优点是程序不必编译为与位置无关(参见边栏 5.5)。每个程序都可以存储在虚拟地址 0 处,并且可以使用绝对地址来引用其地址空间中的内存。实际上,这个优点并不重要,因为编译器设计者生成与位置无关的指令并不困难。
In the design with many virtual address spaces, virtual addresses are relative to an address space. This property has the advantage that programs don’t have to be compiled to be position independent (see Sidebar 5.5). Every program can be stored at virtual address 0 and can use absolute addresses for making references to memory in its address space. In practice, this advantage is unimportant because it is not difficult for compiler designers to generate position-independent instructions.
边栏 5.5 位置无关程序
Sidebar 5.5 Position-Independent Programs
位置无关程序可以加载到任何内存地址。为了提供此功能,将程序转换为处理器指令的编译器必须生成相对地址,而不是绝对地址。例如,在编译 for循环时,编译器应使用相对于当前PC具有偏移量的跳转指令返回到for循环的顶部,而不是使用具有绝对地址的跳转指令。
Position-independent programs can be loaded at any memory address. To provide this feature, a compiler translating programs into processor instructions must generate relative, but not absolute, addresses. For example, when compiling a for loop, the compiler should use a jump instruction with an offset relative to the current PC to return to the top of a for loop rather than a jump instruction with an absolute address.
具有多个地址空间的设计的缺点是共享可能造成混乱并且灵活性较差。线程可能会将要共享的块映射到不同虚拟地址的两个不同地址空间中,因此可能会造成混乱。
A disadvantage of the design with many address spaces is that sharing can be confusing and less flexible. It can be confusing because a block to be shared can be mapped by threads into two different address spaces at different virtual addresses.
它可能不太灵活,因为要么线程共享完整的地址空间,要么设计者必须接受共享限制。不同地址空间中的线程只能以块的粒度共享对象:如果不同地址空间中的两个线程共享一个对象,则该对象必须映射到页面边界,并且保存该对象需要分配整数个页面和块。如果共享对象小于页面,则部分地址空间和块将被浪费。第 5.4.5 节描述了一种没有此限制的高级设计,但它很少使用,因为在实践中浪费并不是一个大问题。
It can be less flexible because either threads share a complete address space or a designer must accept a restriction on sharing. Threads in different address spaces can share objects only at the granularity of a block: if two threads in different address spaces share an object, that object must be mapped at a page boundary, and holding the object requires allocating an integral number of pages and blocks. If the shared object is smaller than a page, then part of the address space and the block will be wasted. Section 5.4.5 describes an advanced design that doesn’t have this restriction, but it is rarely used, since the waste isn’t a big problem in practice.
硬件和软件设计师之间持续存在的争论涉及虚拟内存管理器的哪些部分应该作为处理器的一部分在硬件中实现,哪些部分应该作为操作系统的一部分在软件中实现,以及硬件模块和软件模块之间的接口应该是什么样子。
An ongoing debate between hardware and software designers concerns what parts of the virtual memory manager should be implemented in hardware as part of the processor and what parts in software as part of the operating system, as well as what the interface between the hardware module and the software module should be.
没有“正确”的答案,因为设计人员必须在性能和灵活性之间做出权衡。由于地址转换是使用地址的处理器指令的关键路径,因此内存管理器通常作为数字电路实现,它是主处理器的一部分,以便能够以处理器的速度运行。然而,完整的硬件实现减少了操作系统利用虚拟地址和物理地址之间转换的机会。在实现内存管理器及其页表时,必须小心进行这种权衡。
There is no “right” answer because the designers must make a trade-off between performance and flexibility. Because address translation is in the critical path of processor instructions that use addresses, the memory manager is often implemented as a digital circuit that is part of the main processor so that it can run at the speed of the processor. A complete hardware implementation, however, reduces the opportunities for the operating system to exploit the translation between virtual and physical addresses. This trade-off must be made with care when implementing the memory manager and its page table.
页表通常与数据存储在同一内存中,可通过总线访问。这种设计要求处理器每次解释地址时都要对内存进行额外的总线引用:处理器必须先将虚拟地址转换为物理地址,这需要读取页面映射中的条目。
The page table is usually stored in the same memory as the data, reachable over the bus. This design requires that the processor make an additional bus reference to memory each time it interprets an address: the processor must first translate the virtual address into a physical address, which requires reading an entry in the page map.
为了避免在将虚拟地址转换为物理地址时进行这些额外的总线引用,处理器通常在处理器本身内较小的快速存储器中维护一个页表条目缓存。希望当处理器执行下一条指令时,它将发现先前缓存的可以转换地址的条目,而无需对较大的存储器进行总线引用。只有当缓存存储器不包含相应条目时,处理器才必须从存储器中检索条目。实际上,这种设计效果很好,因为大多数程序都表现出引用局部性。因此,缓存转换条目是有回报的,我们将在第6 章学习缓存时看到这一点。在处理器中缓存页表条目会带来新的复杂性:如果处理器更改了页表条目,则缓存的版本也必须更新或失效。
To avoid these additional bus references for translating virtual to physical addresses, the processor typically maintains a cache of entries of the page map in a smaller fast memory within the processor itself. The hope is that when the processor executes the next instruction, it will discover a previously cached entry that can translate the address, without making a bus reference to the larger memory. Only when the cache memory doesn’t contain the appropriate entry must the processor retrieve an entry from memory. In practice, this design works well because most programs exhibit locality of reference. Thus, caching translation entries pays off, as we will see when we study caches in Chapter 6. Caching page table entries in the processor introduces new complexities: if a processor changes a page table entry, the cached versions must be updated too, or invalidated.
最后一个设计问题是如何高效地实现翻译缓存。一种常见的方法是使用关联内存而不是索引内存。通过使缓存具有关联性,任何条目都可以存储任何翻译。此外,由于缓存比物理内存小得多,因此关联内存是可行的。在这种设计中,翻译缓存通常称为翻译后备缓冲区(TLB)。
A final design issue is how to implement the cache memory of translations efficiently. A common approach is to use an associative memory instead of an indexed memory. By making the cache memory associative, any entry can store any translation. Furthermore, because the cache is much smaller than physical memory, an associative memory is feasible. In this design, the cache memory of translations is usually referred to as the translation look-aside buffer (TLB).
在图 5.19的硬件设计中,页表的格式由硬件决定。RISC 处理器通常不会在硬件中固定页表的格式,而是将数据结构的选择留给软件。在这些 RISC 设计中,只有 TLB 是在硬件中实现的。当转换不在 TLB 中时,处理器会生成 TLB未命中异常。此中断的处理程序在软件实现的数据结构中查找映射,将转换插入 TLB 中,然后从中断返回。通过这种设计,内存管理器可以完全自由地选择页面映射的数据结构。如果模块仅使用几个页面,设计人员可以通过将页面映射存储为链接列表或页面树来节省内存。如果虚拟地址的并集比物理内存大得多(这很常见),设计人员可以通过反转页面映射并为每个物理内存块存储一个条目来节省内存;条目的内容标识当前在块中的页面编号。
In the hardware design of Figure 5.19, the format of the page table is determined by the hardware. RISC processors typically don’t fix the format of the page table in hardware but leave the choice of data structure to software. In these RISC designs, only the TLB is implemented in hardware. When a translation is not in the TLB, the processor generates a TLB miss exception. The handler for this interrupt looks up the mapping in a data structure implemented in software, inserts the translation in the TLB, and returns from the interrupt. With this design, the memory manager has complete freedom in choosing the data structure for the page map. If a module uses only a few pages, a designer may be able to save memory by storing the page map as a linked list or tree of pages. If, as is common, the union of virtual addresses is much larger than the physical memory, a designer may be able to save memory by inverting the page map and storing one entry per physical memory block; the contents of the entry identify the number of the page currently in the block.
在几乎所有虚拟地址设计中,操作系统都使用软件来管理页面映射的内容。硬件设计可能会决定表的格式,但内核会确定表条目中存储的值,从而确定从虚拟地址到物理地址的映射。通过允许软件控制映射,设计人员可以开发虚拟地址的多种用途。一种用途是高效管理物理内存,避免碎片化问题。另一种用途是通过允许将页面存储在其他设备(如磁盘)上来扩展物理内存,如第 6.2 节所述。
In almost all designs of virtual addresses, the operating system manages the content of the page map in software. The hardware design may dictate the format of the table, but the kernel determines the values stored in the table entries and thus the mapping from virtual addresses to physical addresses. By allowing software to control the mapping, designers open up many uses of virtual addresses. One use is to manage physical memory efficiently, avoiding problems due to fragmentation. Another use is to extend physical memory by allowing pages to be stored on other devices, such as magnetic disks, as explained in Section 6.2.
每个程序的地址空间(如图5.21所示)限制了线程之间共享对象的方式。另一种方法是使用段,它为每个对象提供一个从 0 开始到对象大小结束的虚拟地址空间。在段方法中,大型数据库程序可能为数据库中的每个表都有一个段(假设表不大于段的地址空间)。这允许线程以对象而不是块的粒度灵活地共享内存。一个线程可以与一个线程共享一个对象(段),与另一个线程共享另一个对象(段)。
An address space per program (as in Figure 5.21) limits the way objects can be shared between threads. An alternative way is to use segments, which provide each object with a virtual address space starting at 0 and ending at the size of the object. In the segment approach, a large database program may have a segment for each table in a database (assuming the table isn’t larger than a segment’s address space). This allows threads to share memory at the granularity of objects instead of blocks, and in a flexible manner. A thread can share one object (segment) with one thread and another object (segment) with another thread.
为了支持段,必须修改处理器,因为程序使用的地址实际上是两个数字。第一个数字标识段号,第二个数字标识该段内的地址。与每个程序都有一个虚拟地址空间的模型不同,在这种模型中,程序员不知道虚拟地址是由页码和偏移量实现的,而在段模型中,编译器和程序员必须知道地址包含两个部分。程序员必须指定要将哪个段用于指令,编译器必须将生成的代码放在正确的段中,等等。问题集11使用简单的操作系统和对段提供最低限度支持的处理器来探索段。
To support segments, the processor must be modified because the addresses that programs use are really two numbers. The first identifies the segment number, and the second identifies the address within that segment. Unlike the model that has one virtual address space per program, where the programmer is unaware that the virtual address is implemented as a page number and an offset, in the segment model, the compiler and programmer must be aware that an address contains two parts. The programmer must specify which segment to use for an instruction, the compiler must put the generated code in the right segment, and so on. Problem set 11 explores segments with a simple operating system and a processor with minimal support for segments.
在每个程序的地址空间中,线程可以对地址进行算术运算,因为程序的地址空间是线性的。在段模型中,线程无法对不同段中的地址进行算术运算,因为对段号进行加法运算会产生无意义的结果;段号没有连续性的概念。
In the address space per program, a thread can do arithmetic on addresses because the program’s address space is linear. In the segment model, a thread cannot do arithmetic on addresses in different segments because adding to a segment number yields a meaningless result; there is no notion of contiguity for segment numbers.
如果两个线程共享一个对象,它们通常会对该对象使用相同的段号;否则命名共享对象也会变得麻烦。
If two threads share an object, they typically use the same segment number for the object; otherwise naming shared objects becomes cumbersome too.
段可以通过重新引入略微修改过的域寄存器来实现。每个段都有自己的域寄存器,但是我们在域寄存器中添加了一个page_table字段。这个page_table字段包含应该用来转换段虚拟地址的页表的物理地址。当域寄存器以这种方式使用时,文献中将它们称为段描述符。使用这种实现,内存管理器按如下方式转换地址:内存管理器使用段号查找段描述符,并使用段描述符中的page_table将段内的虚拟地址转换为物理地址。
Segments can be implemented by reintroducing slightly modified domain registers. Each segment has its own domain register, but we add a page_table field to the domain register. This page_table field contains the physical address of the page table that should be used to translate virtual addresses of the segment. When domain registers are used in this way, the literature calls them segment descriptors. Using this implementation, the memory manager translates an address as follows: the memory manager uses the segment number to look up the segment descriptor and uses the page_table in the segment descriptor to translate the virtual address within the segment to a physical address.
为应用程序的每个对象分配一个自己的段可能需要每个处理器提供大量的段描述符。我们可以通过将段描述符放在内存中的段描述符表中并为每个处理器分配一个指向段描述符表的寄存器来解决此问题。
Giving each object of an application its own segment potentially requires a large number of segment descriptors per processor. We can solve this problem by putting the segment descriptors in memory in a segment descriptor table and giving each processor a single register that points to the segment descriptor table.
段模型的一个优点是设计人员不必预测在计算过程中动态增长的对象的最大大小。例如,随着正在运行的计算的堆栈增长,虚拟内存管理器可以根据需要分配更多页面并增加堆栈段的长度。在每个程序的地址空间模型中,线程的堆栈可能会增长为虚拟地址空间中的另一个数据结构。然后虚拟内存管理器必须引发错误,或者必须将整个堆栈移动到地址空间中具有足够大的未使用连续地址范围的位置。
An advantage of the segment model is that the designer doesn’t have to predict the maximum size of objects that grow dynamically during computation. For example, as the stack of a running computation grows, the virtual memory manager can allocate more pages on demand and increase the length of the stack segment. In the address space per program model, the thread’s stack may grow into another data structure in the virtual address space. Then either the virtual memory manager must raise an error, or the complete stack must be moved to a place in the address space that has a large enough range of unused contiguous addresses.
每个对象一个段的编程模型非常适合用面向对象风格编写的新程序:对象类的方法可以位于具有READ和EXECUTE权限的段中,该类实例的数据对象位于具有 READ 和 WRITE 权限的段中,等等。如果将完整的程序、代码和数据存储在一个段中,则将旧程序移植到段模型会很容易,但是这种方法失去了段模型的许多优势,因为整个段必须具有READ、WRITE和EXECUTE权限。重构旧程序以利用多个段可能很困难,因为地址不是线性的;程序员必须修改旧程序以指定要使用哪个段。例如,升级内核程序以在其内部构造中利用段会造成破坏。许多处理器和内核都尝试过但都失败了(参见第 5.7 节)。
The programming model that goes with a segment per object can be a good match for new programs written in an object-oriented style: the methods of an object class can be in a segment with READ and EXECUTE permissions, the data objects of an instance of that class in a segment with READ and WRITE permissions, and so on. Porting an old program to the segment model can be easy if one stores the complete program, code, and data in a single segment, but this method loses much of the advantage of the segment model because the entire segment must have READ, WRITE, and EXECUTE permission. Restructuring an old program to take advantage of multiple segments can be challenging because addresses are not linear; the programmer must modify the old program to specify which segment to use. For example, upgrading a kernel program to take advantage of segments in its internal construction is disruptive. A number of processors and kernels tried but failed (see Section 5.7).
尽管支持段的虚拟内存系统具有优势并且具有影响力(例如,参见 Multics 虚拟内存设计 [进一步阅读建议 5.4.1 ]),但当今大多数虚拟内存系统都遵循每个程序一个地址空间的方法,而不是段方法。一些处理器(例如 Intel x86(参见第 5.7 节))支持段,但当今的虚拟内存系统并未利用它们。每个程序一个地址空间模型的虚拟内存管理器往往不太复杂,因为共享通常不是主要要求。设计人员主要将每个程序一个地址空间视为实现强制模块化的一种方法,而不是共享的方法。虽然可以在程序之间共享页面,但这不是主要目标,但可以以有限的方式做到这一点。此外,将旧应用程序移植到每个程序一个地址空间模型在开始时几乎不需要付出任何努力:只需为应用程序分配一个完整的地址空间。如果需要任何共享,可以稍后进行。实际上,共享模式往往很简单,因此不需要复杂的支持。最后,今天,地址空间通常足够大,程序不需要每个对象都有一个地址空间。
Although virtual memory systems supporting segments have advantages and have been influential (see, for example, the Multics virtual memory design [Suggestions for Further Reading 5.4.1]), most virtual memory systems today follow the address space per program approach instead of the segment approach. A few processors, such as the Intel x86 (see Section 5.7), have support for segments, but today’s virtual memory systems don’t exploit them. Virtual memory managers for the address space per program model tend to be less complex because sharing is not usually a primary requirement. Designers view an address space per program primarily as a method for achieving enforced modularity, rather than an approach to sharing. Although one can share pages between programs, that isn’t the primary goal, but it is possible to do it in a limited way. Furthermore, porting an old application to the one address space per program model requires little effort at the outset: just allocate a complete address space for the application. If any sharing is necessary, it can be done later. In practice, sharing patterns tend to be simple, so no sophisticated support is necessary. Finally, today, address spaces are usually large enough that a program doesn’t need an address space per object.
为了每次只关注一个新想法,前面几节假设每个线程都有一个单独的处理器可用。由于通常没有足够的处理器可供使用,本节扩展了线程管理器,以消除每个线程都有自己的处理器的假设。这个扩展的线程管理器在大量线程之间共享有限数量的处理器。
In order to focus on one new idea at a time, the previous sections assumed that a separate processor was available to run each thread. Because there are usually not enough processors to go around, this section extends the thread manager to remove the assumption that each thread has its own processor. This extended thread manager shares a limited number of processors among a larger number of threads.
共享处理器带来了新的问题:如果某个线程占用了某个处理器(无论是有意还是无意),它都会减慢甚至停止其他线程的进程,从而损害模块化。由于我们已提出主要要求是强制模块化,因此线程管理器的设计挑战之一就是消除这一问题。
Sharing of processors introduces a new concern: if a thread hogs a processor, either accidentally or intentionally, it can slow down or even halt the progress of other threads, thereby compromising modularity. Because we have proposed that a primary requirement be to enforce modularity, one of the design challenges of the thread manager is to eliminate this concern.
本节首先介绍一个无法避免占用处理器的简单线程管理器的设计,然后介绍一个可以避免占用处理器的设计。它通过提供线程管理器的伪代码实现使设计具体化。此实现抓住了线程管理器的本质。实际上,线程管理器在许多细节上有所不同,有时比我们的示例实现复杂得多。
This section starts with the design of a simple thread manager that does not avoid the hogging of a processor, and then moves to a design that does. It makes the design concrete by providing a pseudocode implementation of a thread manager. This implementation captures the essence of a thread manager. In practice, thread managers differ in many details and sometimes are much more complex than our example implementation.
回想一下第 5.1.1 节,线程是一种封装正在运行的模块状态的抽象。线程封装了执行该线程的解释器(例如处理器)的足够多的状态,以便线程管理器可以停止该线程并在稍后恢复该线程。线程管理器通过为线程提供处理器来激活该线程。本节将介绍如何使用这种停止线程并在稍后恢复线程的能力在有限数量的物理处理器上多路复用多个线程。
Recall from Section 5.1.1 that a thread is an abstraction that encapsulates the state of a running module. A thread encapsulates enough state of the interpreter (e.g., a processor) that executes the thread that a thread manager can stop the thread and resume the thread later. A thread manager animates a thread by giving it a processor. This section explains how this ability to stop a thread and later resume it can be used to multiplex many threads over a limited number of physical processors.
为了使线程抽象具体化,线程管理器可能支持这个简单版本的THREAD_ALLOCATE过程:
To make the thread abstraction concrete, the thread manager might support this simple version of a THREAD_ALLOCATE procedure:
thread_id ← ALLOCATE_THREAD ( starting_procedure, address_space_id ):在地址空间address_space_id中分配一个新线程。新线程将从调用参数Starting_procedure中指定的过程开始。ALLOCATE_THREAD返回一个标识符,该标识符命名刚刚创建的线程。如果线程管理器无法分配新线程(例如,它没有足够的可用内存来分配新堆栈), ALLOCATE_THREAD将返回错误。
thread_id ← ALLOCATE_THREAD (starting_procedure, address_space_id): allocate a new thread in address space address_space_id. The new thread is to begin with a call to the procedure specified in the argument starting_procedure. ALLOCATE_THREAD returns an identifier that names the just-created thread. If the thread manager cannot allocate a new thread (e.g., it doesn’t have enough free memory to allocate a new stack), ALLOCATE_THREAD returns an error.
线程管理器按如下方式实现ALLOCATE_THREAD :它在address_space_ id中分配一段内存作为过程调用的堆栈,选择一个处理器,并将处理器的PC设置为address_space_id中的地址Starting_procedure,将处理器的SP设置为分配堆栈的底部。
The thread manager implements ALLOCATE_THREAD as follows: it allocates a range of memory in address_space_id to be used as the stack for procedure calls, selects a processor, and sets the processor’s PC to the address starting_procedure in address_space_id and the processor’s SP to the bottom of the allocated stack.
使用ALLOCATE_THREAD,应用程序可以创建比处理器更多的线程。考虑图 5.4中计算机上运行的应用程序。这些应用程序可以拥有比处理器更多的线程;例如,文件服务可能会为每个新的客户端请求启动一个新线程。启动新模块也会创建额外的线程。因此,问题在于在可能的许多线程之间共享有限数量的处理器。
Using ALLOCATE_THREAD, an application can create more threads than there are processors. Consider the applications running on the computer in Figure 5.4. These applications can have more threads than there are processors; for example, the file service might launch a new thread for each new client request. Starting a new module will also create additional threads. So the problem is to share a limited number of processors among potentially many threads.
我们可以通过观察大多数线程花费大量时间等待事件发生来解决这个问题。图 5.4中计算机上运行的大多数模块都调用READ来获取来自键盘、文件系统或网络的输入,并通过旋转等待直到有输入。线程可以在等待时让另一个线程使用其处理器,而不是旋转。如果消费者线程发现它无法继续(因为缓冲区为空),那么它可以释放其处理器,让键盘管理器(或任何其他线程)有机会运行。
We can solve this problem by observing that most threads spend much of their time waiting for events to happen. Most modules that run on the computer in Figure 5.4 call READ for input from the keyboard, the file system, or the network, and will wait by spinning until there is input. Instead of spinning, the thread could, while it is waiting, let another thread use its processor. If the consumer thread finds that it cannot proceed (because the buffer is empty), then it could release its processor, giving the keyboard manager (or any other thread) a chance to run.
这一观察是虚拟化处理器的基本思想:当一个线程正在等待事件时,其处理器可以通过保存等待线程的状态并加载另一个线程的状态来从该线程切换到另一个线程。由于在大多数系统设计中,许多线程在其生命周期中花费大量时间等待条件变为真,因此这个想法是通用的。例如,大多数其他模块(窗口管理器、邮件阅读器等)都是消费者。它们在其存在中花费大量时间等待输入到达。
This observation is the basic idea for virtualizing the processor: when a thread is waiting for an event, its processor can switch from that thread to another one by saving the state of the waiting thread and loading the state of a different thread. Since in most system designs many threads spend much of their life waiting for a condition to become true, this idea is general. For example, most of the other modules (the window manager, the mail reader, etc.) are consumers. They spend much of their existence waiting for input to arrive.
多年来,人们为处理器虚拟化方案开发了各种标签,例如“分时” 、 “处理器复用” 、 “多道程序设计”或“多任务处理”。例如,“分时”一词是在 20 世纪 50 年代引入的,用于描述计算机系统的虚拟化,以便它可以在多个交互用户之间共享。所有这些方案都归结为同一个想法:虚拟化处理器,本节将详细介绍这一点。
Over the years people have developed various labels for processor virtualization schemes, such as “time-sharing”, “processor multiplexing”, “multiprogramming”, or “multitasking”. For example, the word “time-sharing” was introduced in the 1950s to describe virtualization of a computer system so that it could be shared among several interactive human users. All these schemes boil down to the same idea: virtualizing the processor, which this section describes in detail.
为了使讨论更加具体,请考虑图 5.6中具有有界缓冲区的SEND和RECEIVE的实现。如果每个线程都有一个处理器,则此自旋循环解决方案是合适的,但如果处理器数量少于线程数量,则不合适。如果只有一个处理器,并且接收方在发送方之前启动,那么我们就会遇到一个大问题。接收方线程执行其自旋循环,而发送方永远没有机会运行并将项目添加到缓冲区。
To make the discussion more concrete, consider the implementation of SEND and RECEIVE with a bounded buffer in Figure 5.6. This spin-loop solution is appropriate if there is a processor for each thread, but it is inappropriate if there are fewer processors than threads. If there is just one processor and if the receiver started before the sender, then we have a major problem. The receiver thread executes its spinning loop, and the sender never gets a chance to run and add an item to the buffer.
图 5.22显示了一个具有线程切换的解决方案。将此代码与图 5.6中的代码进行比较,我们发现唯一的变化是增加了两个对名为YIELD 的过程的调用(第 10行和第 20行)。YIELD是线程管理器的一个入口。当一个线程调用YIELD时,线程管理器将调用线程的处理器交给其他线程。在未来的某个时间,线程管理器通过从对YIELD的调用返回,将处理器返回给此线程。对于接收方,当处理器在第 21行返回时,接收线程再次重新获取锁并测试out = in。如果out现在小于in,则缓冲区中至少有一个新项目,因此该线程从缓冲区中提取一个项目。如果没有,线程将释放锁并再次调用YIELD以允许另一个线程运行。因此,线程在两种状态之间交替,我们将其命名为RUNNING(在处理器上执行)和RUNNABLE(准备运行但等待处理器可用)。
A solution with thread switching is shown in Figure 5.22. Comparing this code with the code in Figure 5.6, we find that the only change is the addition of two calls to a procedure named YIELD (lines 10 and 20). YIELD is an entry to the thread manager. When a thread invokes YIELD, the thread manager gives the calling thread’s processor to some other thread. At some time in the future, the thread manager returns a processor to this thread by returning from the call to YIELD. In the case of the receiver, when the processor returns at line 21, the receiving thread reacquires the lock again and tests out = in. If out is now less than in, there is at least one new item in the buffer, so the thread extracts an item from the buffer. If not, the thread releases the lock and calls YIELD again to allow another thread to run. A thread therefore alternates between two states, which we name RUNNING (executing on a processor) and RUNNABLE (ready to run but waiting for a processor to become available).
图 5.22线程数多于处理器数的系统的虚拟通信链路的实现。
Figure 5.22 An implementation of a virtual communication link for a system with more threads than processors.
YIELD的作用是将处理器从一个线程切换到另一个线程。从本质上讲,YIELD是一个简单的三步操作:
The job of YIELD is to switch a processor from one thread to another. In its essence, YIELD is a simple three-step operation:
1. Save this thread’s state so that it can resume later.
YIELD解决的具体问题是将许多线程复用到可能较少数量的处理器上(见图5.23)。每个处理器通常都有一个标识符 ( ID )、一个堆栈指针 ( SP )、一个程序计数器 ( PC ) 和一个页面映射地址寄存器 ( PMAR ),指向定义线程地址空间的页面映射。处理器可能具有其他状态,例如浮点寄存器。每个线程都有ID、SP、PC和PMAR的虚拟版本以及任何其他状态。YIELD必须在处理器层中有限数量的处理器上复用线程层中的许多线程。
The concrete problem that YIELD solves is multiplexing many threads over a potentially smaller number of processors (see Figure 5.23). Each processor typically has an identifier (ID), a stack pointer (SP), a program counter (PC), and a page-map address register (PMAR), pointing to the page map that defines the thread’s address space. Processors may have additional state, such as floating point registers. Each thread has virtual versions of ID, SP, PC, and PMAR, and any additional state. YIELD must multiplex perhaps many threads in the thread layer over a limited number of processors in the processor layer.
图 5.23在n 个线程之间多路复用m 个处理器(n > m)。
Figure 5.23 Multiplexing m processors among n threads (n > m).
YIELD实现复用的方式如下。当线程层中运行的线程调用YIELD时, YIELD进入处理器层。处理器保存当前正在运行的线程的状态。当该处理器稍后退出处理器层时,它会运行一个新线程,该线程通常与进入时运行的线程不同。这个新线程可能在与调用 YIELD 的线程使用的地址空间不同的地址空间中运行,也可能在相同的地址空间中运行,具体取决于两个线程最初的分配方式。
YIELD implements the multiplexing as follows. When a thread running in the thread layer calls YIELD, YIELD enters the processor layer. The processor saves the state of the thread that is currently running. When that processor later exits from the processor layer, it runs a new thread, usually one that is different from the one it was running when it entered. This new thread may run in a different address space from the one used by the thread that called YIELD, or it may run in the same address space, depending on how the two threads were originally allocated.
由于YIELD的实现特定于处理器,并且必须加载和保存通常存储在处理器寄存器(即SP、PC、PMAR)中的状态,因此YIELD被写入处理器的指令库中,可以被视为处理器的软件扩展。使用YIELD的程序可以用任何编程语言编写,但YIELD本身的实现通常用低级处理器特定指令编写。YIELD通常是通过监控程序调用到达的内核过程。
Because the implementation of YIELD is specific to a processor and must load and save state that is often stored in processor registers (i.e., SP, PC, PMAR), YIELD is written in the instruction repertoire of the processor and can be thought of as a software extension of the processor. Programs using YIELD may be written in any programming language, but the implementation of YIELD itself is usually written in low-level processor-specific instructions. YIELD is typically a kernel procedure reached by a supervisor call.
使用这个分层图,我们还可以解释中断和异常如何与线程复用。中断会调用中断处理程序,该处理程序始终在处理器层中运行,即使中断发生在线程层中。在发生中断时,被中断的处理器会运行相应的中断处理程序(例如,当发生时钟中断时,它会运行时钟处理程序),然后继续执行中断前处理器正在运行的线程。
Using this layering picture, we can also explain how interrupts and exceptions are multiplexed with threads. Interrupts invoke an interrupt handler, which always runs in the processor layer, even if the interrupt occurs while in the thread layer. On an interrupt, the interrupted processor runs the corresponding interrupt handler (e.g., when a clock interrupt occurs, it runs the clock handler) and then continues with the thread that the processor was running before the interrupt.
异常发生在线程层。也就是说,异常处理程序在被中断线程的上下文中运行;它可以访问被中断线程的状态,并且可以代表被中断线程调用程序。
Exceptions happen in the thread layer. That is, the exception handler runs in the context of the interrupted thread; it has access to the interrupted thread’s state and can invoke procedures on behalf of the interrupted thread.
如边栏 5.6中所述,文献中关于中断和异常概念的标签和区别的说法不一致。为了本文的目的,我们将中断定义为可能与当前正在运行的线程无关的事件,而异常则是专门与当前正在运行的线程有关的事件。虽然异常和中断都是由处理器发现的,但中断由处理器层处理,而异常通常被称为线程层中的处理程序。
As discussed in Sidebar 5.6, the literature is inconsistent both about the labels and about the distinction between the concepts of interrupts and exceptions. For purposes of this text, we define interrupts as events that may have no relation to the currently running thread, whereas exceptions are events that specifically pertain to the currently running thread. While both exceptions and interrupts are discovered by the processor, interrupts are handled by the processor layer, and exceptions are usually referred to a handler in the thread layer.
边栏 5.6 中断、异常、故障、陷阱和信号
Sidebar 5.6 Interrupts, Exceptions, Faults, Traps, and Signals
系统和架构文献中“中断”和“异常”这两个词的使用并不一致,有些作者使用不同的词,例如“故障”、“陷阱”、“信号”和“序列中断”。有些设计人员将某一事件称为中断,而另一位设计人员将同一事件称为异常或信号。操作系统设计人员可能会将硬件中断的处理程序标记为异常、陷阱或故障处理程序。术语问题也会出现,因为操作系统中的中断处理程序可能会调用线程的异常处理程序,这引发了原始事件是中断还是异常的问题。分层模型有助于回答这个问题:在处理器层,该事件是中断,而在线程层,它是异常。
The systems and architecture literature uses the words “interrupt” and “exception” inconsistently, and some authors use different words, such as “fault”, “trap”, “signal”, and “sequence break”. Some designers call a particular event an interrupt, while another designer calls the same event an exception or a signal. An operating system designer may label the handler for a hardware interrupt as an exception, trap, or fault handler. Terminology questions also arise because an interrupt handler in the operating system may invoke a thread’s exception handler, which raises the question of whether the original event is an interrupt or an exception. The layered model helps answer this question: at the processor layer the event is an interrupt, and at the thread layer it is an exception.
这种差异限制了中断处理程序中可以运行的代码:一般来说,中断处理程序不应调用线程层中假定线程正在当前处理器上运行的过程(例如YIELD),因为中断可能与当前正在运行的线程无关。如果中断处理程序确定此中断与被中断处理器上运行的线程有关,则可以调用线程层中的异常处理程序。然后,异常处理程序可以调整线程的环境。我们将在第5.5.4 节中看到这种情况的一个示例,其中线程管理器使用时钟中断强制当前正在运行的线程调用YIELD。
This difference places restrictions on what code can run in an interrupt handler: in general, an interrupt handler shouldn’t invoke procedures (e.g., YIELD) of the thread layer that assume that the thread is running on the current processor because the interrupt may have nothing to do with the currently running thread. An interrupt handler can invoke an exception handler in the thread layer if the handler determines that this interrupt pertains to the thread running on the interrupted processor. The exception handler can then adjust the thread’s environment. We will see an example of this case in Section 5.5.4 when the thread manager uses a clock interrupt to force the currently running thread to invoke YIELD.
尽管多路复用本质上很简单,但实现YIELD 的代码通常是操作系统中最神秘的代码之一。为了消除这些神秘感,第 5.5.2 节开发了一个支持YIELD的线程管理器的简单实现。第 5.5.3 节描述了如何扩展此实现以支持线程的创建和终止。第 5.5.4 节解释了操作系统如何使用中断在线程之间强制模块化,第 5.5.5 节增加了对单独地址空间的强制执行。第 5.5.6 节解释了系统如何递归使用多路复用来实现多层处理器虚拟化。
Although the essence of multiplexing is simple, the code implementing YIELD is often among the most mysterious in an operating system. To dispel the mysteries, Section 5.5.2 develops a simple implementation of a thread manager that supports YIELD. Section 5.5.3 describes how this implementation can be extended to support creation and termination of threads. Section 5.5.4 explains how an operating system can enforce modularity among threads using interrupts and Section 5.5.5 adds enforcement of separate address spaces. Section 5.5.6 explains how systems use multiplexing recursively to implement several layers of processor virtualization.
为了使YIELD的实现尽可能简单,我们暂时将其实现限制为固定数量的线程,例如 7 个,并假设处理器数量少于 7 个。(如果有 7 个或更多处理器,但只有 7 个线程,则不需要处理器虚拟化。)我们进一步假设所有 7 个线程都运行在同一个地址空间中,因此我们不必担心保存和恢复线程的PMAR。最后,我们假设线程已在运行。(第 5.5.3 节将删除最后一个假设,解释如何创建线程以及线程管理器如何启动。)
To keep the implementation of YIELD as simple as possible, let’s temporarily restrict its implementation to a fixed number of threads, say, seven, and assume there are fewer than seven processors. (If there were seven or more processors and only seven threads, then processor virtualization would be unnecessary.) We further start by assuming that all seven threads run in the same address space, so we don’t have to worry about saving and restoring a thread’s PMAR. Finally, we will assume that the threads are already running. (Section 5.5.3 will remove the last assumption, explaining how threads are created and how the thread manager starts.)
基于这些假设,我们可以实现YIELD ,如图5.24所示。yield 的实现依赖于四个过程:GET_THREAD_ID、ENTER_PROCESSOR_LAYER、EXIT_PROCESSOR_LAYER和SCHEDULER。每个过程只有几行代码,但它们很微妙;我们将详细研究它们。
With these assumptions we can implement YIELD as shown in Figure 5.24. The implementation of yield relies on four procedures: GET_THREAD_ID, ENTER_PROCESSOR_LAYER, EXIT_PROCESSOR_LAYER, and SCHEDULER. Each procedure has only a few lines of code, but they are subtle; we investigate them in detail.
图 5.24 YIELD的一个实现。EXIT_PROCESSOR_LAYER将返回YIELD ,因为EXIT_PROCESSOR_LAYER使用了ENTER_PROCESS_LAYER中保存的SP。为了更容易理解,这些过程有明确的返回语句。
Figure 5.24 An implementation of YIELD. EXIT_PROCESSOR_LAYER will return to YIELD because EXIT_PROCESSOR_LAYER uses the SP that was saved in ENTER_PROCESS_LAYER. To make it easier to follow, the procedures have explicit return statements.
如图所示,过程代码维护两个共享数组:一个数组,每个处理器一个条目,称为processor_table ;另一个数组,每个线程一个条目,称为thread_table。processor_table数组记录每个处理器的信息。在这个简单的实现中,信息只是处理器当前正在运行的线程的标识。在后续版本中,我们将需要跟踪更多信息。为了能够索引此表,处理器需要知道其标识是什么,该标识通常存储在特殊寄存器CPUID中。也就是说,过程GET_THREAD_ID返回在处理器CPUID上运行的线程的标识(第7行)。过程GET_THREAD_ID虚拟化寄存器CPUID以为每个线程创建一个虚拟ID寄存器,该寄存器记录线程的标识。
As shown in the figure, the code for the procedures maintains two shared arrays: an array with one entry per processor, known as the processor_table, and an array with one entry per thread, known as the thread_table. The processor_table array records information for each processor. In this simple implementation, the information is just the identity of the thread that the processor is currently running. In later versions, we will need to keep track of more information. To be able to index into this table, a processor needs to know what its identity is, which is usually stored in a special register CPUID. That is, the procedure GET_THREAD_ID returns the identity of the thread running on processor CPUID (line 7). The procedure GET_THREAD_ID virtualizes the register CPUID to create a virtual ID register for each thread, which records a thread’s identity.
thread_table的条目i保存线程i的堆栈指针(只要线程i实际上未在处理器上运行),并记录线程i是处于RUNNING状态(即处理器正在运行线程i)还是RUNNABLE 状态(即线程i正在等待接收处理器)。在具有n 个处理器的系统中, n 个线程可以同时处于RUNNING状态。
Entry i of thread_table holds the stack pointer for thread i (whenever thread i is not actually running on a processor) and records whether thread i is RUNNING (i.e., a processor is running thread i) or RUNNABLE (i.e., thread i is waiting to receive a processor). In a system with n processors, n threads can be in the RUNNING state at the same time.
有了这些数据结构,处理器层的工作方式如下。假设两个处理器 A 和 B 正忙于运行七个线程,并且在处理器 A 上运行的线程 0 调用YIELD。YIELD在第 9行获取thread_table_lock,以便处理器层可以将切换线程实现为前后操作。(需要锁定是因为有多个处理器,因此不同的线程可能会尝试同时调用YIELD 。)然后YIELD调用ENTER_PROCESSOR_LAYER来释放其处理器。
With these data structures, the processor layer works as follows. Suppose that two processors, A and B, are busy running seven threads and that thread 0, which is running on processor A, calls YIELD. YIELD acquires thread_table_lock at line 9 so that the processor layer can implement switching threads as a before-or-after action. (The lock is needed because there is more than one processor, so different threads might try to invoke YIELD at the same time.) YIELD then calls ENTER_PROCESSOR_LAYER to release its processor.
第14行的语句记录了调用线程将不再在处理器上运行,但它处于可运行状态。也就是说,如果没有其他线程等待运行,处理器层可以再次调度线程 0。
The statement on line 14 records that the calling thread will no longer be running on the processor but that it is runnable. That is, if there are no other threads waiting to run, the processor layer can schedule thread 0 again.
第15行将线程 0 的堆栈指针(保存在处理器 A 的SP寄存器中)保存到thread_table中的线程 0 条目中。堆栈指针是唯一必须保存的线程状态,因为处理器层始终会暂停ENTER_PROCESSOR_LAYER 中的线程;无需保存和恢复程序计数器。我们假设所有线程都在相同的地址空间中运行,因此也不必保存和恢复PMAR。其他处理器或调用约定(或者如果线程可以在不同于ENTER_PROCESSOR_LAYER 的地址处恢复)可能要求ENTER_PROCESSOR_LAYER保存额外的线程状态。在这种情况下,thread_table条目必须具有额外的字段,并且ENTER_PROCESSOR_LAYER会将额外的状态保存在thread_table条目的额外字段中。
Line 15 saves thread 0’s stack pointer (held in processor A’s SP register) into thread 0’s entry in thread_table. The stack pointer is the only thread state that must be saved because the processor layer always suspends a thread in ENTER_PROCESSOR_LAYER; it is unnecessary to save and restore the program counter. We are assuming that all threads run in the same address space so PMAR doesn’t have to be saved and restored either. Other processors or calling conventions (or if a thread may be resumed at a different address than in ENTER_PROCESSOR_LAYER) might require that ENTER_PROCESSOR_LAYER save additional thread state. In that case, the thread_table entries must have additional fields, and ENTER_PROCESSOR_LAYER would save the additional state in the additional fields of the thread_table entries.
处理器层中负责选择下一个线程的部分称为调度程序。在我们的简单实现中,第 20行到第 22行的语句构成了调度程序的核心。处理器 A 循环遍历线程表,跳过已在另一个处理器上运行的线程,在找到可运行的线程(假设为线程 6)时停止搜索,并将线程 6 的状态设置为RUNNING(第 23行),以便另一个处理器不会再次选择线程 6。此实现以循环方式调度线程,但许多其他策略也是可行的;我们将在第 6 章(第 6.3 节)中讨论其他一些策略。
The part of the processor layer that chooses the next thread is called the scheduler. In our simple implementation, statements on lines 20 through 22 constitute the core of the scheduler. Processor A cycles through the thread table, skips threads that are already running on another processor, stops searching when it finds a runnable thread (let’s say thread 6), and sets thread 6’s state to RUNNING (line 23) so that another processor doesn’t again select thread 6. This implementation schedules threads in a round-robin fashion, but many other policies are possible; we discuss some others in Chapter 6 (Section 6.3).
处理器层的此实现假设线程数大于(或至少等于)处理器数。在此假设下,处理器 A 将选择并运行与调用YIELD的线程不同的线程,除非线程数与处理器数相同,在这种情况下,处理器 A 将循环回到调用YIELD的线程,因为所有其他线程都在其他处理器上运行。如果线程数少于处理器数,此实现将使处理器 A 永远循环遍历thread_table而不放弃thread_table_lock,从而阻止任何其他线程调用YIELD。我们将在5.5.3 节中修复此问题,其中我们将介绍支持动态创建和终止线程的YIELD版本。
This implementation of the processor layer assumes that the number of threads is more than (or at least equal to) the number of processors. Under this assumption, processor A will select and run a thread different from the one that called YIELD, unless the number of threads is the same as the number of processors, in which case processor A will cycle back to the thread that called YIELD because all the other threads are running on other processors. If there are fewer threads than processors, this implementation leaves processor A cycling forever through thread_table without giving up thread_table_lock, preventing any other thread from calling YIELD. We will fix this problem in Section 5.5.3, where we introduce a version of YIELD that supports the dynamic creation and termination of threads.
在选择运行线程 6 后,处理器会记录线程 6 正在此处理器上运行(第24行),这样在下次调用ENTER_PROCESSOR_LAYER时,处理器就会知道自己正在运行哪个线程。处理器通过调用EXIT_PROCESSOR_LAYER离开处理器层,从而将处理器 A 调度到线程 6。线程管理器的这一部分通常称为调度程序。
After selecting thread 6 to run, the processor records that thread 6 is running on this processor (line 24), so that on the next call to ENTER_PROCESSOR_LAYER the processor knows which thread it is running. The processor leaves the processor layer by calling EXIT_PROCESSOR_LAYER, which dispatches processor A to thread 6. This part of the thread manager is often called the dispatcher.
过程EXIT_PROCESSOR_LAYER将线程 6 的已保存堆栈指针加载到处理器 A 的SP寄存器中(第30行)。(在必须恢复其他线程状态的实现中,需要扩展这些行。)现在处理器 A 正在运行线程 6。
The procedure EXIT_PROCESSOR_LAYER loads the saved stack pointer of thread 6 into processor A’s SP register (line 30). (In implementations that have additional thread state that must be restored, these lines would need to be expanded.) Now processor A is running thread 6.
因为第 30行将SP替换为线程 6 上次运行时在第 15行保存的值,所以处理器到达第 31行返回时的控制流需要一些思考。该返回会从堆栈中弹出一个返回地址。该返回地址是线程 6在第 10行调用ENTER_PROCESSOR_LAYER时推送到其堆栈上的地址。因此,第 31行返回实际上转到ENTER_PROCESSOR_LAYER的调用者,即第 11行的YIELD。第 12行从堆栈中弹出下一个返回地址,将控制权返回给线程 6 中最初调用YIELD的程序。总体效果是线程 0 调用了YIELD ,但控制权返回到线程 6 中调用YIELD之后的指令。
Because line 30 replaces SP with the value that thread 6 saved on line 15 when it last ran, the flow of control when the processor reaches the return on line 31 requires some thought. That return pops a return address off the stack. The return address is the address that thread 6 pushed on its stack when it called ENTER_PROCESSOR_LAYER at line 10. Thus, the line 31 return actually goes to the caller of ENTER_PROCESSOR_LAYER, namely, YIELD, at line 11. Line 12 pops the next return address off the stack, returning control to the program in thread 6 that originally called YIELD. The overall effect is that thread 0 called YIELD, but control returns to the instruction after the call to YIELD in thread 6.
这种控制流具有放弃两个堆栈帧的奇怪效果,这两个堆栈帧是在对SCHEDULER和EXIT_PROCESSOR_LAYER的调用中分配的。在第 15行,线程 6 中对SP的原始保存实际上实现了两个目标:(1) 保存SP的值以供将来控制权返回到线程 6 时使用,以及 (2) 标记一个位置,以便处理器层线程在执行SCHEDULER和EXIT_PROCESSOR_LAYER时可以用作堆栈。在第30行对SP的重新加载类似地实现了两个目标:(1) 恢复线程 6 堆栈,以及 (2) 放弃不再需要的处理器层堆栈。正如我们将在第 5.5.3 节中看到的那样,更复杂的线程管理器设计切换到单独的处理器层堆栈,而不是借用现有线程层堆栈顶部的空间。
This flow of control has the curious effect of abandoning two stack frames, the ones allocated on the calls to SCHEDULER and EXIT_PROCESSOR_LAYER. The original save of SP in thread 6 at line 15 actually accomplished two goals: (1) save the value of SP for future use when control returns to thread 6, and (2) mark a place that the processor layer thread can use as a stack when executing SCHEDULER and EXIT_PROCESSOR_LAYER. The reloading of SP at line 30 similarly accomplishes two goals: (1) restore the thread 6 stack and (2) abandon the processor layer stack, which is no longer needed. A more elaborate thread manager design, as we will see in Section 5.5.3, switches to a separate processor layer stack rather than borrowing space atop an existing thread layer stack.
要理解YIELD的这种实现为何有效,请考虑两个线程:一个运行图 5.22中的SEND过程,另一个运行RECEIVE过程。此外,假设发送方线程是线程 0,接收方线程是线程 6,并且过程的数据和指令位于内存中的地址 1001 及以上。最后,假设线程 0 的保存线程状态如下,处理器 A 的当前状态如下:
To understand why this implementation of YIELD works, consider two threads: one running the SEND procedure of Figure 5.22 and one running the RECEIVE procedure. Furthermore, assume that the sender thread is thread 0, that the receiver thread is thread 6, and that the data and instructions of the procedures are located at address 1001 and up in memory. Finally, assume the following saved thread state for thread 0 and the following current state for processor A:
在过去的某个时间,线程 0 调用YIELD和ENTER_PROCESSOR_LAYER将线程 0 的堆栈指针 (100) 的值存储到线程表中,然后继续运行其他线程。处理器 A 当前正在运行线程 6:A 在 processing_table 数组中的条目包含6,A 的SP寄存器指向线程 6 的堆栈顶部,A 的PC寄存器包含地址 1009,其中包含YIELD的第一条指令(参见第9行)。
At some time in the past, thread 0 called YIELD and ENTER_PROCESSOR_LAYER stored the value of thread 0’s stack pointer (100) into the thread table and went on to run some other thread. Processor A is currently running thread 6: A’s entry in the processor_table array contains 6, A’s SP register points to the top of the stack of thread 6, and A’s PC register contains address 1009, which holds the first instruction of YIELD (see line 9).
YIELD调用过程ENTER_PROCESSOR_LAYER,遵循 4.1.1 的过程调用约定,它将一些值推送到线程 6 的堆栈上 — 特别是返回地址 (1011) — 并将 A 的SP更改为 220 (204 + 16)。 * ENTER_PROCESSOR_LAYER通过读取处理器在处理器表数组中的条目知道当前线程的索引为 6。第15行通过将处理器 A 的SP存储到线程 6 在线程表的条目中来保存线程 6 的当前堆栈顶部 (220)。
YIELD invokes the procedure ENTER_PROCESSOR_LAYER, following the procedure call convention of 4.1.1, which pushes some values on thread 6’s stack—in particular, the return address (1011)—and change A’s SP to 220 (204 + 16).* ENTER_PROCESSOR_LAYER knows that the current thread has index 6 by reading the processor’s entry in the processor_table array. Line 15 saves thread 6’s current top of stack (220) by storing processor A’s SP into thread 6’s entry into thread_table.
第 19行到第 22行的语句使用简单的循环算法选择下一个要运行的线程,并选择线程 0。调度程序调用EXIT_PROCESSOR_LAYER将处理器 A 调度到线程 0。
The statements at lines 19 through 22 choose which thread to run next, using a simple round-robin algorithm, and select thread 0. The scheduler invokes EXIT_PROCESSOR_LAYER to dispatch processor A to thread 0.
第30行加载线程 0 已保存的SP,以便处理器 A 可以在内存地址 100 处找到堆栈顶部。线程 0 堆栈的顶部将是返回地址;此地址将是 1011(在YIELD中调用ENTER_PROCESSOR_LAYER之后的行,第11行),因为线程 0从YIELD进入ENTER_PROCESSOR_LAYER。线程 0 释放thread_table_lock,以便另一个线程可以进入ENTER_PROCESSOR_LAYER并从YIELD返回。线程 0 按照过程调用约定从EXIT_PROCESSOR_LAYER返回,这会从堆栈顶部弹出返回地址。EXIT_PROCESSOR_LAYER 使用的地址是 1011,因为EXIT_PROCESSOR_LAYER使用由ENTER_PROCESSOR_LAYER保存的SP,因此在第 11行返回到YIELD。YIELD释放thread_table_lock并将控制权返回给最初调用YIELD的线程 0 中的程序。
Line 30 loads the saved SP of thread 0 so that processor A can find the top of the stack at memory address 100. At the top of thread 0’s stack will be the return address; this address will be 1011 (the line after the call to ENTER_PROCESSOR_LAYER into YIELD, line 11), since thread 0 entered ENTER_PROCESSOR_LAYER from YIELD. Thread 0 releases thread_table_lock so that another thread can enter ENTER_PROCESSOR_LAYER and return from YIELD. Thread 0 returns from EXIT_PROCESSOR_LAYER following the procedure call convention, which pops off the return address from the top of the stack. The address that EXIT_PROCESSOR_LAYER uses is 1011 because EXIT_PROCESSOR_LAYER uses the SP saved by ENTER_PROCESSOR_LAYER and thus returns to YIELD at line 11. YIELD releases the thread_table_lock and returns control to the program in thread 0 that originally called YIELD.
此时,线程切换已完成,在处理器 A 上运行的是线程 0(而不是线程 6);状态如下:
At this point, the thread switch has completed, and thread 0, rather than thread 6, is running on processor A; the state is as follows:
在未来的某个时间,线程管理器将在地址 1011 处的指令处恢复线程 6。
At some time in the future, the thread manager will resume thread 6, at the instruction at address 1011.
从这个例子中我们可以看出,线程总是通过调用ENTER_PROCESSOR_LAYER来释放其处理器,并且线程总是在调用ENTER_PROCESSOR_LAYER之后立即恢复。这种程式化的控制流(线程总是在同一点释放其处理器并在该点恢复)是有时称为协同程序的一个例子。
From this example we can see that a thread always releases its processor by calling ENTER_PROCESSOR_LAYER and that the thread always resumes right after the call to ENTER_PROCESSOR_LAYER. This stylized flow of control in which a thread always releases its processor at the same point and resumes at that point is an example of what is sometimes called co-routine.
为了确保线程切换是原子性的,调用ENTER_PROCESSOR_LAYER的线程会获取thread_table_lock ,而使用EXIT_PROCESSOR_LAYER恢复的线程会释放thread_table_lock(第11行)。由于调度程序可能会选择与调用YIELD的线程不同的线程来运行,因此释放锁的线程很可能与获取锁的线程不同。实际上,释放处理器的线程会将锁传递给下一个接收处理器的线程。
To ensure that the thread switch is atomic, the thread that invokes ENTER_PROCESSOR_LAYER acquires thread_table_lock and the thread that resumes using EXIT_PROCESSOR_LAYER releases thread_table_lock (line 11). Because the scheduler is likely to choose a different thread to run from the one that called YIELD, the thread that releases the lock is most likely a different thread from the one that acquired the lock. In essence, the thread that releases the processor passes the lock along to the thread that next receives the processor.
线程切换依赖于对处理器和过程调用约定的详细了解。在大多数系统中,线程切换的实现比图 5.24中的实现更复杂,因为我们做出了几个在实际系统中通常不成立的假设:线程数固定,所有线程都是可运行的,并且循环调度线程是一种可接受的策略。在下一节中,我们将消除其中一些假设。
Thread switching relies on a detailed understanding of the processor and the procedure call convention. In most systems, the implementation of thread switching is more complex than the implementation in Figure 5.24 because we made several assumptions that often don’t hold in real systems: there is a fixed number of threads, all threads are runnable, and scheduling threads round-robin is an acceptable policy. In the next sections, we will eliminate some of these assumptions.
示例YIELD过程仅支持固定数量的线程。功能齐全的线程管理器允许根据需要创建和终止线程。为了支持可变数量的线程,我们需要修改ALLOCATE_THREAD的实现并使用以下过程扩展线程管理器:
The example YIELD procedure supports only a fixed number of threads. A full-blown thread manager allows threads to be created and terminated on demand. To support a variable number of threads, we would need to modify the implementation of ALLOCATE_THREAD and extend the thread manager with the following procedures:
EXIT_THREAD ():销毁并清理调用线程。当线程完成其工作时,它会调用EXIT_THREAD来释放其状态。
EXIT_THREAD (): destroy and clean up the calling thread. When a thread is done with its job, it invokes EXIT_THREAD to release its state.
DESTROY_THREAD ( id ):销毁由id标识的线程。在某些情况下,一个线程可能需要终止另一个线程。例如,用户可能启动了一个线程,但后来发现该线程存在编程错误(如无限循环),因此用户想要终止它。对于这些情况,我们可能需要提供一个销毁线程的过程。
DESTROY_THREAD (id): destroy the thread identified by id. In some cases, one thread may need to terminate another thread. For example, a user may have started a thread that turns out to have a programming error such as an endless loop, and thus the user wants to terminate it. For these cases, we might want to provide a procedure to destroy a thread.
在大多数情况下,这些过程的实现相对简单,但也存在一些微妙的问题。例如,如果线程可以终止,我们必须解决之前的代码至少需要与处理器一样多的线程的问题。为了解决这些问题,我们详细介绍了它们的实现。首先,我们为每个处理器创建一个单独的线程(我们将其称为处理器层线程,或简称为处理器线程),该线程运行过程SCHEDULER(参见图 5.25)。考虑此设置的方式是SCHEDULER在处理器层中运行,并虚拟化其处理器。每个处理器一个处理器线程是必要的,因为线程层中的线程(线程层线程)无法释放自己的堆栈,因为它无法在其已释放的堆栈上调用过程(例如DEALLOCATE或YIELD)。相反,我们将其设置为处理器层线程清理线程层线程。启动操作系统内核时(例如,打开计算机后),内核会按如下方式创建处理器层线程:
For the most part, the implementation of these procedures is relatively straightforward, but there are a few subtle issues. For example, if threads can terminate, we have to fix the problem that the previous code required at least as many threads as processors. To get at these issues, we detail their implementation. First, we create a separate thread for each processor (which we will call a processor-layer thread, or processor thread for short), which runs the procedure SCHEDULER (see Figure 5.25). The way to think about this setup is that the SCHEDULER runs in the processor layer, and it virtualizes its processor. A processor thread per processor is necessary because a thread in the thread layer (a thread-layer thread) cannot deallocate its own stack since it cannot call a procedure (e.g., DEALLOCATE or YIELD) on a stack that it has released. Instead, we set it up so that the processor-layer thread cleans up thread-layer threads. When starting the operating system kernel (e.g., after turning the computer on), the kernel creates processor-layer threads as follows:
图 5.25 支持动态线程创建和删除的YIELD 。控制流并不明显,因为其中一些过程会重新加载SP,从而改变它们返回的位置。为了更容易理解,这些过程有明确的返回语句。第 12行调用的过程实际上是通过将控制权传递给第 24行而返回的,而第 23行调用的过程实际上是通过将控制权传递给第 13行而返回的。图 5.26以图形方式显示了控制流。
Figure 5.25 YIELD with support for dynamic thread creation and deletion. Control flow is not obvious because some of those procedures reload SP, which changes the place to which they return. To make it easier to follow, the procedures have explicit return statements. The procedure called on line 12 actually returns by passing control to line 24, and the procedure called on line 23 actually returns by passing control to line 13. Figure 5.26 shows the control flow graphically.
程序 RUN_PROCESSORS ()
procedure RUN_PROCESSORS ()
对于 每个处理器
for each processor do
分配堆栈并设置处理器线程
allocate stack and set up a processor thread
关闭← FALSE
shutdown ← FALSE
调度程序()
SCHEDULER ()
释放处理器线程堆栈
deallocate processor thread stack
暂停处理器
halt processor
此过程分配一个堆栈并为每个处理器设置一个处理器线程。此线程运行调度程序,直到某个过程将全局变量shutdown设置为TRUE。然后,计算机重新启动或停止。
This procedure allocates a stack and sets up a processor thread for each processor. This thread runs the scheduler procedure until some procedure sets the global variable shutdown to TRUE. Then, the computer restarts or halts.
我们首先在此设置下重新审视YIELD,然后了解此泛化如何支持线程创建和删除。使用单独的处理器线程,我们发现将处理器从一个线程层线程切换到另一个线程层线程实际上需要两次线程切换:一次从释放其处理器的线程切换到处理器线程,一次从处理器线程切换到要接收处理器的线程(参见图 5.26)。更详细地,我们假设,像以前一样,线程 0 在处理器 A 上调用YIELD,线程 6 处于可运行状态并且之前已调用YIELD 。线程 0 通过调用ENTER_PROCESSOR_LAYER(第 12行)切换到处理器线程。ENTER_PROCESSOR_LAYER的实现与图 5.24中的ENTER_PROCESSOR_LAYER几乎相同:它将堆栈指针保存在调用线程的thread_table条目中,但它从CPUID的process_table条目加载新的堆栈指针。当ENTER_PROCESSOR_LAYER返回时,它将切换到处理器线程并从第 24行(紧接着EXIT_PROCESSOR_LAYER)恢复。
We first revisit YIELD with this setup, and we then see how this generalization supports thread creation and deletion. Using a separate processor thread, we find that switching a processor from one thread-layer thread to another actually requires two thread switches: one from the thread that is releasing its processor to the processor thread and one from the processor thread to the thread that is to receive the processor (see Figure 5.26). In more detail, let’s suppose, as before, that thread 0 calls YIELD on processor A and that thread 6 is runnable and has called YIELD earlier. Thread 0 switches to the processor thread by invoking ENTER_PROCESSOR_LAYER (line 12). The implementation of ENTER_PROCESSOR_LAYER is almost identical to ENTER_PROCESSOR_LAYER of Figure 5.24: it saves the stack pointer in the calling thread’s thread_table entry, but it loads a new stack pointer from CPUID’s processor_table entry. When ENTER_PROCESSOR_LAYER returns, it will switch to the processor thread and resume at line 24 (right after EXIT_PROCESSOR_LAYER).
图 5.26线程 0 让步给线程 6 时的控制流示例。
Figure 5.26 Control flow example when thread 0 yields to thread 6.
处理器线程将循环遍历线程表,直到遇到可运行的线程 6。SCHEDULER将线程 6 的状态设置为RUNNING(第 21行),记录线程 6 将在此处理器上运行(第22行),并调用EXIT_PROCESSOR_LAYER将处理器切换到线程 6(第23行)。EXIT_PROCESSOR_LAYER将调度程序的线程状态保存到CPUID在处理器表中的条目中,并将线程 6的状态加载到处理器中。由于EXIT_PROCESSOR_LAYER的第 37行已加载SP,因此第38行的return语句的作用类似于从保存SP 的过程返回。该过程是第 33行的ENTER_PROCESSOR_LAYER ,因此控制权传递给ENTER_PROCESSOR_LAYER的调用者,即第 13行的YIELD。YIELD释放thread_table_lock并将控制权返回给最初调用它的线程 6 的程序。
The processor thread will cycle through the thread table until it hits thread 6, which is runnable. The SCHEDULER sets thread 6’s state to RUNNING (line 21), records that thread 6 will run on this processor (line 22), and invokes EXIT_PROCESSOR_LAYER, to switch the processor to thread 6 (line 23). EXIT_PROCESSOR_LAYER saves the scheduler’s thread state into CPUID’s entry in the processor_table and loads thread 6’s state in the processor. Because line 37 of EXIT_PROCESSOR_LAYER has loaded SP, the return statement at line 38 acts like a return from the procedure that saved SP. That procedure was ENTER_PROCESSOR_LAYER at line 33, so control passes to the caller of ENTER_PROCESSOR_LAYER, namely, YIELD, at line 13. YIELD releases thread_table_lock and returns control to the program of thread 6 that originally called it.
有了这种线程切换设置,我们可以返回动态创建和释放线程。为了跟踪正在使用的thread_table条目,我们用附加状态FREE扩展了每个条目的可能状态集。现在我们可以按如下方式实现ALLOCATE_THREAD :
With this setup of thread switching in place, we can return to creating and deallocating threads dynamically. To keep track of which thread_table entries are in use, we extend the set of possible states of each entry with the additional state FREE. Now we can implement ALLOCATE_THREAD as follows:
1. Allocate space in memory for a new stack.
2.在新堆栈上放置一个仅包含返回地址的空框架,并使用EXIT_THREAD的地址初始化该返回地址。
2. Place on the new stack an empty frame containing just a return address and initialize that return address with the address of EXIT_THREAD.
3.在堆栈上放置仅包含返回地址的第二个空框架,并使用Starting_procedure 的地址初始化该返回地址。
3. Place on the stack a second empty frame containing just a return address and initialize this return address with the address of starting_procedure.
4.在线程表中找到一个空闲的条目,并通过存储新堆栈的顶部为线程表中的新线程初始化该条目。
4. Find an entry in the thread table that is FREE and initialize that entry for the new thread in the thread table by storing the top of the new stack.
如果线程管理器无法完成这些步骤(例如,线程表中的所有条目都在使用中),则THREAD_ALLOCATE返回错误。
If the thread manager cannot complete these steps (e.g., all entries in the thread table are in use), then THREAD_ALLOCATE returns an error.
为了说明此实现,请考虑新创建的线程 1 的以下状态:
To illustrate this implementation, consider the following state for a newly created thread 1:
线程 1 的堆栈位于地址 292,其保存的堆栈指针为 300。通过此初始设置,似乎EXIT_THREAD调用了过程STARTING_PROCEDURE,并且线程 1 即将返回到此过程。因此,当SCHEDULER选择此线程时,其返回语句将转到过程Starting_procedure。具体来说,当调度程序选择新线程 (1) 作为下一个要执行的线程时,它将其堆栈指针设置为EXIT_PROCESSOR_LAYER中新堆栈 (300) 的顶部。当处理器从EXIT_PROCESSOR_LAYER返回时,它会将其程序计数器设置为堆栈顶部的地址 ( starting_procedure ),并从该位置开始执行。过程Starting_procedure释放thread_table_lock并且新线程正在运行。
Thread 1’s stack is located at address 292 and its saved stack pointer is 300. With this initial setup, it appears that EXIT_THREAD called the procedure STARTING_PROCEDURE, and thread 1 is about to return to this procedure. Thus, when SCHEDULER selects this thread, its return statement will go to the procedure starting_procedure. In detail, when the scheduler selects the new thread (1) as the next thread to execute, it sets its stack pointer to the top of the new stack (300) in EXIT_PROCESSOR_LAYER. When the processor returns from EXIT_PROCESSOR_LAYER, it will set its program counter to the address on top of the stack (starting_procedure), and start execution at that location. The procedure starting_procedure releases thread_table_lock and the new thread is running.
有了此初始设置,当线程完成过程starting_procedure时,它将使用标准过程返回约定返回。由于THREAD_CREATE过程已将EXIT_THREAD过程的地址放在堆栈上,因此此返回将控制权转移到EXIT_THREAD过程的第一条指令。
With this initial setup, when a thread finishes the procedure starting_procedure, it returns using the standard procedure return convention. Since the THREAD_CREATE procedure has put the address of the EXIT_THREAD procedure on the stack, this return transfers control to the first instruction of the EXIT_THREAD procedure.
EXIT_THREAD过程可以按如下方式实现:
The EXIT_THREAD procedure can be implemented as follows:
1 个 程序 EXIT_THREAD()
1 procedure EXIT_THREAD()
2 获取(thread_table_lock)
2 ACQUIRE (thread_table_lock)
3 线程表[ tid ]. kill_or_continue ←杀死
3 thread_table[tid].kill_or_continue ← KILL
4 进入处理器层(获取线程ID(),CPUID)
4 ENTER_PROCESSOR_LAYER (GET_THREAD_ID (), CPUID)
EXIT_THREAD设置线程的kill_or_continue变量并调用ENTER_PROCESSOR_LAYER,后者将处理器切换到处理器线程。处理器线程检查第 24行的kill_or_continue变量以查看线程是否已完成,如果已完成,则将线程条目标记为可重用(第 25行)并释放其堆栈(第26行)。由于没有线程正在使用该堆栈,因此可以安全地释放它。
EXIT_THREAD sets the kill_or_continue variable for thread and calls ENTER_PROCESSOR_LAYER, which switches the processor to the processor thread. The processor thread checks the variable kill_or_continue on line 24 to see if a thread is done, and, if so, it marks the thread entry as reusable (line 25) and deallocates its stack (line 26). Since no thread is using that stack, it is safe to deallocate it.
DESTROY_THREAD的实现也有点棘手,因为要销毁的目标线程可能正在某个处理器上运行。因此,调用线程不能只释放目标线程的堆栈;运行目标线程的处理器必须这样做。我们可以用间接的方式实现这一点。DESTROY_THREAD只是将目标线程的kill_or_continue变量设置为KILL并返回。当线程调用YIELD并进入处理器层时,处理器线程将检查此变量并释放线程的资源。(第 5.5.4 节将展示如何确保在处理器上运行的每个线程至少偶尔会调用YIELD 。)
The implementation of DESTROY_THREAD is also a bit tricky because the target thread to be destroyed might be running on one of the processors. Thus, the calling thread cannot just free the target thread’s stack; the processor running the target thread must do that. We can achieve that in an indirect way. DESTROY_THREAD just sets the kill_or_continue variable of the target thread to KILL and returns. When a thread invokes YIELD and enters the processor layer, the processor thread will check this variable and release the thread’s resources. (Section 5.5.4 will show how to ensure that each thread running on a processor will call YIELD at least occasionally.)
所描述的分配和释放线程的实现只是处理线程创建和销毁的众多方法之一。如果打开六个不同线程包的内部结构,就会发现有六种完全不同的方法来处理启动和终止线程。本节的目标不是展示完整的目录,而是通过详细说明一个示例来消除任何神秘感并揭示每个实现必须解决的主要问题。问题集10探讨了在单处理器和两个线程的简单操作系统中线程包的实现。
The implementation described for allocating and deallocating threads is just one of many ways of handling the creation and destruction of threads. If one opens up the internals of half a dozen different thread packages, one will find half a dozen quite different ways to handle launching and terminating threads. The goal of this section was not to exhibit a complete catalog, but rather, by illustrating one example in detail, to dispel any mystery and expose the main issues that every implementation must address. Problem set 10 explores an implementation of a thread package in a trivial operating system for a single processor and two threads.
到目前为止描述的线程管理器仅在线程调用YIELD时才切换到新线程。这种调度策略称为非抢占式调度,即线程持续运行直到它放弃其处理器。它可能存在问题,因为线程占用其处理器的时间长度完全由线程本身控制。例如,如果编程错误导致一个线程进入无限循环,则其他线程将永远无法再次使用该处理器。非抢占式调度可能适用于具有多个线程的单个模块(例如,具有多个线程以提高性能的 Web 服务器),但不适用于多个模块。
The thread manager described so far switches to a new thread only when a thread calls YIELD. This scheduling policy, where a thread continues to run until it gives up its processor, is called non-preemptive scheduling. It can be problematic because the length of time a thread holds its processor is entirely under the control of the thread itself. If, for example, a programming error sends one thread into an endless loop, no other thread will ever be able to use that processor again. Non-preemptive scheduling might be acceptable for a single module that has several threads (e.g., a Web server that has several threads to increase performance) but not for several modules.
一些系统通过一种称为协作式调度(在文献中有时称为协作式多任务)的君子协议部分解决了这一问题:每个线程都应该定期调用YIELD,例如每 100 毫秒一次。这种解决方案并不可靠,因为它依赖于行为良好且没有任何错误的模块。如果程序员忘记输入 YIELD ,或者程序意外陷入了未包含YIELD 的无限循环,则该处理器将不再参与君子协议。如果只有一个处理器(这在协作式多任务设计中很常见),则该处理器可能看起来冻结了,因为其他线程将没有机会取得进展。
Some systems partially address this problem by having a gentlemen’s agreement called cooperative scheduling (which in the literature sometimes is called cooperative multitasking): every thread is supposed to call YIELD periodically, for instance, once per 100 milliseconds. This solution is not robust because it relies on modules behaving well and not having any errors. If a programmer forgets to put in a YIELD, or if the program accidentally gets into an endless loop that does not include a YIELD, that processor will no longer participate in the gentlemen’s agreement. If, as is common with cooperative multitasking designs, there is only a single processor, the processor may appear to freeze, since the other threads won’t have an opportunity to make progress.
为了在多个线程之间强制模块化,操作系统线程管理器必须使用所谓的抢占式调度来确保线程切换。线程管理器必须强制线程在例如 100 毫秒后放弃其处理器。线程管理器可以通过设置时钟设备的间隔计时器来实现抢占式调度。当计时器到期时,时钟将触发中断,切换到处理器层的内核模式。然后,时钟中断处理程序可以调用异常处理程序,该处理程序在线程层运行并强制当前正在运行的线程放弃。因此,如果某个线程处于无限循环中,它会有 100 毫秒的时间轮到它运行,但它无法阻止其他线程也获得至少部分处理器的使用权。
To enforce modularity among multiple threads, the operating system thread manager must ensure thread switching by using what is called preemptive scheduling. The thread manager must force a thread to give up its processor after, for example, 100 milliseconds. The thread manager can implement preemptive scheduling by setting the interval timer of a clock device. When the timer expires, the clock triggers an interrupt, switching to kernel mode in the processor layer. The clock interrupt handler can then invoke an exception handler, which runs in the thread layer and forces the currently running thread to yield. Thus, if a thread is in an endless loop, it receives 100 milliseconds to run on its turn, but it cannot stop other threads from getting at least some use of the processor, too.
支持抢占式调度需要对线程管理器进行一些更改,因为在目前描述的实现中,中断处理程序根本不应该调用线程层中的过程,即使中断与当前正在运行的线程有关也不行。要了解原因,请考虑一个处理器,它调用一个调用YIELD的中断处理程序。如果中断发生在该处理器上的线程在YIELD中获取thread_table_lock之后,那么我们将产生死锁。处理程序中的YIELD调用也将尝试获取thread_table_lock,但它已被中断的线程获取。该线程无法继续并释放锁,因为它已被处理程序中断。
Supporting preemptive scheduling requires some changes to the thread manager because in the implementation described so far an interrupt handler shouldn’t invoke procedures in the thread layer at all, not even when the interrupt pertains to the currently running thread. To see why, consider a processor that invokes an interrupt handler that calls YIELD. If the interrupt happens right after the thread on that processor has acquired thread_table_lock in YIELD, then we will create a deadlock. The YIELD call in the handler will try to acquire thread_table_lock too, but it already has been acquired by the interrupted thread. That thread cannot continue and release the lock because it has been interrupted by the handler.
问题在于,我们在处理器层中有并发活动(见图5.23):线程管理器(即YIELD)和中断处理程序。线程层中的并发执行通过锁进行协调,但处理器需要自己的机制。处理器可能随时停止处理线程的指令并切换到处理中断指令。我们缺少一种机制来将处理器指令流和中断指令流变成单独的前后操作。
The problem is that we have concurrent activity within the processor layer (see Figure 5.23): the thread manager (i.e., YIELD) and the interrupt handler. The concurrent execution within the thread layer is coordinated with locks, but the processor needs its own mechanism. The processor may stop processing instructions of a thread at any time and switch to processing interrupt instructions. We are lacking a mechanism to turn the processor instruction stream and the interrupt instruction stream into separate before-or-after actions.
防止中断指令流干扰处理器指令流的一个解决方案是启用/禁用中断。禁用大于 thread_table_lock设置区域的区域的中断可确保两个流在操作前后都是独立的。当线程即将获取 thread_table_lock 时,它还会禁用其处理器上的中断。现在,处理器在中断到达时不会切换到中断处理程序;中断会被延迟,直到再次启用。线程释放 thread_table_lock 后,可以安全地重新启用中断。任何待处理的中断都会立即执行,但现在是安全的,因为此处理器上的任何线程都不能位于线程管理器内。此解决方案避免了死锁问题。有关 Plan 9 操作系统中的挑战和解决方案的更详细描述,请参阅进一步阅读建议 5.3.5。
One solution to prevent the interrupt instruction stream from interfering with the processor instruction stream is to enable/disable interrupts. Disabling interrupts for a region greater than the region in which the thread_table_lock is set ensures that both streams are separate before-or-after actions. When a thread is about to acquire the thread_table_lock, it also disables interrupts on its processor. Now the processor will not switch to an interrupt handler when an interrupt arrives; interrupts are delayed until they are enabled again. After the thread has released the thread_table_lock, it is safe to reenable interrupts. Any pending interrupts will then execute immediately, but it is now safe since no thread on this processor can be inside the thread manager. This solution avoids the deadlock problem. For a more detailed description of the challenges and the solution in the Plan 9 operating system, see Suggestions for Further Reading 5.3.5.
问题集9探讨了针对单个处理器量身定制的简单操作系统的具有抢占式调度的线程包的实现,这允许使用其他解决方案来协调中断。
Problem set 9 explores an implementation of a thread package with preemptive scheduling for a trivial operating system tailored to a single processor, which allows for other solutions to coordinating interrupts.
抢占式调度是一种在线程之间强制模块化的机制,因为它将线程彼此隔离,从而保证任何线程都无法停止其他线程的进程。因此,程序员可以将模块编写为标准计算机程序,使用其自己的线程执行它,而不必担心系统中的任何其他模块。即使多个程序共享处理器,程序员也可以独立考虑每个模块,并且可以将每个模块视为拥有自己的处理器。此外,如果编程错误导致模块进入无限循环,则与用户交互的另一个模块将有机会在某个时刻运行,从而允许用户通过调用 THREAD_DESTROY过程来销毁行为不当的线程。
Preemptive scheduling is the mechanism that enforces modularity among threads because it isolates threads from one another’s behavior, guaranteeing that no thread can halt the progress of other threads. The programmer can thus write a module as a standard computer program, execute it with its own thread, and not have to worry about any other modules in the system. Even though several programs are sharing the processors, programmers can consider each module independently and can think of each module as having a processor to itself. Furthermore, if a programming error causes a module to enter into an endless loop, another module that interacts with the user gets a chance to run at some point, thus allowing the user to destroy the ill-behaving thread by calling the THREAD_DESTROY procedure.
抢占式调度强制模块化,即一个线程不能阻止另一个线程的进程,但如果所有线程共享一个地址空间,那么它们可能会意外修改彼此的内存。对于共同解决一个共同问题的线程来说,这可能没问题,但不相关的线程需要受到保护,以免彼此错误或恶意存储。我们可以通过让线程管理器了解第5.4 节中的虚拟地址空间来提供这种保护。
Preemptive scheduling enforces modularity in the sense that one thread cannot stop the progress of another thread, but if all threads share a single address space, then they can modify each other’s memory accidentally. That may be okay for threads that are working together on a common problem, but unrelated threads need to be protected from erroneous or malicious stores of one another. We can provide that protection by making the thread manager aware of the virtual address spaces of Section 5.4.
这种感知可以通过让线程管理器在将处理器从一个线程切换到另一个线程时同时切换地址空间来实现。也就是说,ENTER_PROCESSOR_LAYER将处理器的PMAR内容保存在释放处理器的线程的thread_table条目中,而EXIT_PROCESSOR_LAYER将新线程的thread_table条目中的值加载到处理器的PMAR中。
This awareness can be implemented by having the thread manager, when it switches a processor from one thread to another, also switch the address space. That is, ENTER_PROCESSOR_LAYER saves the contents of the processor’s PMAR in the thread_table entry of the thread that is releasing the processor, and EXIT_PROCESSOR_LAYER loads the processor’s PMAR with the value in the thread_table entry of the new thread.
加载PMAR会给线程管理器带来一个重大的复杂性:从处理器将新值加载到 PMAR 的那一刻开始,处理器将使用新的页表转换虚拟地址,以便从新虚拟地址空间中的某个位置获取下一条指令。如第 5.4.3.2 节所述,解决这一复杂性的一种方法是将线程管理器的指令和数据映射到每个虚拟地址空间中的同一组虚拟地址中。另一种可能性是设计硬件,使其能够将 PMAR 、 SP和PC作为单个之前或之后操作加载,从而将控制权返回到新虚拟地址空间中保存位置和保存堆栈指针的线程。
Loading the PMAR adds one significant complication to the thread manager: starting at the instant that the processor loads a new value into the PMAR, the processor will translate virtual addresses using the new page table, so that it will take its next instruction from some location in the new virtual address space. As mentioned in Section 5.4.3.2, one way to deal with this complication is to map both the instructions and the data of the thread manager into the same set of virtual addresses in every virtual address space. Another possibility is to design hardware that can load the PMAR, SP, and PC as a single before-or-after action, thereby returning control to the thread in the new virtual address space at the saved location and with the saved stack pointer.
图 5.23以及图 5.24和5.25中的程序片段展示了如何从处理器层中的一个线程创建线程层中的多个线程。特别是,图 5.25解释了如何使用处理器层中的处理器线程动态创建和删除线程层中的线程。许多系统都推广了此实现以支持中断处理和多层线程管理,如图5.27所示。
Figure 5.23 and the program fragments in Figures 5.24 and 5.25 showed how to create several threads in the thread layer from one thread in the processor layer. In particular, Figure 5.25 explained how a processor thread in the processor layer can be used to dynamically create and delete threads in the thread layer. Many systems generalize this implementation to support interrupt handling and multiple layers of thread management, as shown in Figure 5.27.
图 5.27线程管理器以递归方式应用。
Figure 5.27 Thread managers applied recursively.
为了支持中断,我们可以将处理器视为具有两个线程的硬连线线程管理器:(1) 处理器线程(例如,图 5.25中运行SCHEDULER的线程)和 (2) 在内核模式下运行中断处理程序的中断线程。发生中断时,处理器的硬连线线程管理器从处理器线程切换到在内核模式下运行中断处理程序的中断线程,这可能会调用调用YIELD的线程层异常处理程序。
To support interrupts, we can think of a processor as a hard-wired thread manager with two threads: (1) a processor thread (e.g., the thread that runs SCHEDULER in Figure 5.25) and (2) an interrupt thread that runs interrupt handlers in kernel mode. On an interrupt, a processor’s hard-wired thread manager switches from a processor thread to an interrupt thread that runs an interrupt handler in kernel mode, which may invoke a thread-layer exception handler that calls YIELD.
操作系统线程层使用处理器层的处理器线程来实现第二层线程,并为每个应用程序模块提供一个或多个抢占式调度的虚拟处理器。当操作系统线程管理器切换到另一个线程时,它可能还必须将所选线程的页面映射地址加载到页面映射地址寄存器中,以切换到所选线程的地址空间。操作系统线程管理器在内核模式下运行。
The operating system thread layer uses the processor threads of the processor layer to implement a second layer of threads and gives each application module one or more preemptively scheduled virtual processors. When the operating system thread manager switches to another thread, it may also have to load the chosen thread’s pagemap address into the page-map address register to switch to the address space of the chosen thread. The operating system thread manager runs in kernel mode.
如果需要,每个应用程序模块都可以使用操作系统层提供的一个或多个虚拟处理器来实现自己的用户模式第三层线程管理器。例如,某些 Web 服务器具有嵌入式 Java 解释器来运行 Java 程序,这些程序可能会使用多个 Java 线程。为了在 Java 级别支持线程,Java 解释器具有自己的线程管理器。通常,第三层线程管理器使用非抢占式调度,因为所有线程都属于同一个应用程序模块,并且不必相互保护。
Each application module in turn may implement, if it desires, its own, user-mode, third-layer thread manager using one or more virtual processors provided by the operating system layer. For example, some Web servers have an embedded Java interpreter to run Java programs, which may use several Java threads. To support threads at the Java level, the Java interpreter has its own thread manager. Typically, a third-layer thread manager uses non-preemptive scheduling because all threads belong to the same application module and don’t have to be protected from each other.
概括起来,我们得到图 5.27中的图片,其中第n层的多个线程可用于实现第 n + 1 层的更高层的线程。最低层的每个硬件处理器都会创建两个线程:一个处理器线程和一个中断线程。上一层,操作系统使用处理器线程为每个模块提供一个或多个线程:一个线程用于编辑器,一个线程用于窗口管理器,一个线程用于键盘管理器,以及若干个用于文件服务的线程。再上一层,文件服务线程用两个操作系统线程创建三个应用程序级线程:一个用于等待磁盘,另一个用于两个客户端请求。在每一层,线程管理器在若干个第 n 层线程之间切换第 n - 1 层的一个或多个线程。
Generalizing, we get the picture in Figure 5.27, where a number of threads at layer n can be used to implement higher-layer threads at layer n + 1. Each hardware processor at the lowest layer creates two threads: a processor thread and an interrupt thread. One layer up, the operating system uses the processor threads to provide one or more threads per module: one thread for the editor, one thread for the window manager, one thread for the keyboard manager, and several threads for the file service. One layer further up, the file service thread creates three application-level threads out of two operating system threads: one to wait for the disk and one for each of two client requests. At each layer, a thread manager switches one or more threads of layer n − 1 among several layer n threads.
虽然分层的理念在理论上很简单,但在实践中必须仔细考虑许多问题 — 例如,如果线程阻塞的层与创建它和运行调度程序的层不同。Clark [建议进一步阅读 5.3.3 ] 和 Anderson 等人 [建议进一步阅读 5.3.2 ] 讨论了一些实际问题。
Although the layering idea is simple in the abstract, in practice a number of issues must be carefully thought through—for example, if a thread blocks in a layer different than the layer where it was created and where its scheduler runs. Clark [Suggestions for Further Reading 5.3.3] and Anderson et al. [Suggestions for Further Reading 5.3.2] discuss some of the practical issues.
5.5 节中描述的线程管理器允许多个线程共享处理器。线程可以释放其处理器,以便其他线程有机会运行,就像发送者和接收者在图 5.22中使用YIELD所做的那样。当再次调度发送者或接收者时,它会重新测试共享变量的输入和输出。这种线程不断测试共享变量的交互模式称为轮询。软件中的轮询通常是不可取的,因为每次线程发现共享变量的测试失败时,它就不必要地获取和释放了其处理器。如果系统有许多轮询线程,那么线程管理器会花费大量时间执行不必要的线程切换,而不是运行有生产性工作要执行的线程。
The thread manager described in Section 5.5 allows processors to be shared among many threads. A thread can release its processor so that other threads get a chance to run, as the sender and receiver do using YIELD in Figure 5.22. When the sender or receiver is scheduled again, it retests the shared variables in and out. This mode of interaction, where a thread continually tests a shared variable, is called polling. Polling in software is usually undesirable because every time a thread discovers that the test for a shared variable fails, it has acquired and released its processor needlessly. If a system has many polling threads, then the thread manager spends much time performing unnecessary thread switches instead of running threads that have productive work to perform.
理想情况下,线程管理器应仅在线程有有用的工作要执行时才调度线程。也就是说,我们更倾向于一种避免在调用YIELD时旋转的等待方式。例如,具有有限缓冲区的发送方应该能够告诉线程管理器不要运行它,直到in — out < N。(即,直到缓冲区有空间。)实现此目标的一种方法是让线程管理器支持序列协调原语,这就是本节要探讨的内容。
Ideally, a thread manager should schedule a thread only when the thread has useful work to perform. That is, we would prefer a way of waiting that avoids spinning on calls to YIELD. For example, a sender with a bounded buffer should be able to tell the thread manager not to run it until in — out < N. (That is, until the buffer has room.) One way to approach this goal is for a thread manager to support primitives for sequence coordination, which is what this section explores.
要了解我们需要哪些原语来进行序列协调,请考虑一个明显但不正确的发送方和接收方实现,如图5.28所示。此实现使用发送方和接收方之间共享的变量,以及两个新的但不充分的原语 — WAIT和NOTIFY — 它们将共享变量的名称作为参数:
To see what we need for the primitives for sequence coordination, consider an obvious, but incorrect, implementation of sender and receiver, as shown in Figure 5.28. This implementation uses a variable shared between the sender and receiver, and two new, but inadequate, primitives—WAIT and NOTIFY—that take as argument the name of the shared variable:
图 5.28具有锁、 NOTIFY和WAIT的系统虚拟通信链路的实现。
Figure 5.28 An implementation of a virtual communication link for a system with locks, NOTIFY, and WAIT.
WAIT ( event_name ) 是一个之前或之后的操作,它将这个线程的状态设置为WAITING,将event_name放置在这个线程的线程表条目中,并产生其处理器。
WAIT (event_name) is a before-or-after action that sets this thread’s state to WAITING, places event_name in the thread table entry for this thread, and yields its processor.
NOTIFY( event_name )是一个之前或之后的操作,它在线程表中查找处于WAITING for event_name状态的线程,然后将该线程更改为RUNNABLE状态。
NOTIFY (event_name) is a before-or-after action that looks in the thread table for a thread that is in the state WAITING for event_name and changes that thread to the RUNNABLE state.
为了支持此接口,线程管理器必须将WAITING状态添加到线程表中线程的RUNNING和RUNNABLE状态中。当调度程序运行时(例如,当某个线程调用YIELD时),它会跳过处于WAITING状态的任何线程。
To support this interface, the thread manager must add the WAITING state to the RUNNING and RUNNABLE state for threads in the thread table. When the scheduler runs (for example, when some thread invokes YIELD), it skips over any thread that is in the WAITING state.
图中程序按如下方式使用这些原语。线程调用WAIT以允许线程管理器释放线程的处理器,直到调用NOTIFY(第 15行和第 25行)。更改in的线程调用NOTIFY(第 15行)以告知线程管理器将处理器提供给等待非空的接收线程(第22行),因为现在缓冲区中有一条消息(即out < in )。更新out的线程也类似地调用NOTIFY (第 25行),以告知线程管理器将处理器提供给等待空间的发送线程(第 12行),因为现在有空间可以向缓冲区添加消息。此实现避免了不必要的线程切换,因为等待的接收线程只有在调用NOTIFY后才会收到处理器。
The program in the figure uses these primitives as follows. A thread invokes WAIT to allow the thread manager to release the thread’s processor until a call to NOTIFY (lines 15 and 25). The thread that changes in invokes NOTIFY (line 15) to tell the thread manager to give a processor to a receiver thread waiting on nonempty (line 22), since now there is a message in the buffer (i.e., out < in). There is a similar call to NOTIFY by the thread that updates out (line 25), to tell the thread manager to give a processor to a sending thread waiting on room (line 12), since now there is room to add a message to the buffer. This implementation avoids needless thread switches because the waiting receiver thread receives a processor only if NOTIFY has been called.
不幸的是,使用WAIT和NOTIFY会引入竞争条件。假设缓冲区为空(即in = out),并且接收方和发送方在不同的处理器上运行。以下语句顺序将导致通知丢失:A20、A21、B9到B17和A22:
Unfortunately, the use of WAIT and NOTIFY introduces a race condition. Let’s assume that the buffer is empty (i.e., in = out) and a receiver and a sender run on separate processors. The following order of statements will result in a lost notification: A20, A21, B9 through B17, and A22:
接收方发现缓冲区为空(A20)并释放锁(A21),但在接收方执行A22之前,发送方执行B9到B17,将一个项目添加到缓冲区并通知接收方。通知丢失,因为接收方尚未调用WAIT。现在接收方执行WAIT(A22)并等待永远不会到来的通知。发送方继续将项目添加到缓冲区,直到缓冲区已满,然后调用WAIT。现在接收方和发送方都在等待。
The receiver finds that buffer is empty (A20) and releases the lock (A21), but before the receiver executes A22, the sender executes B9 through B17, which adds an item to the buffer and notifies the receiver. The notification is lost because the receiver hasn’t called WAIT yet. Now the receiver executes WAIT (A22) and waits for a notification that will never come. The sender continues adding items to the buffer until the buffer is full and then calls WAIT. Now both the receiver and sender are waiting.
我们可以修改程序,在每次调用SEND时调用NOTIFY,但这并不能解决问题。这将使通知丢失的可能性更小,但并不能消除这种可能性。可能会发生以下语句顺序:接收方执行A20和A21,然后中断足够长的时间,以便发送方添加N 个项目,然后接收方调用A22。按照这种顺序,接收方会错过所有重复的通知。
We could modify the program to call NOTIFY on each invocation of SEND, but that won’t fix the problem. It will make it more unlikely that the notification will be lost, but it won’t eliminate the possibility. The following ordering of statements could happen: the receiver executes A20 and A21, then it is interrupted long enough for the sender to add N items, and then the receiver calls A22. With this ordering, the receiver misses all of the repeated notifications.
交换语句21和22也会导致通知丢失。然后接收方将调用WAIT ,同时仍持有buffer_lock上的锁。但发送方需要能够获取buffer_lock才能通知接收方,因此一切都会停止。
Swapping statements 21 and 22 will result in a lost notification too. Then the receiver would call WAIT while still holding the lock on buffer_lock. But the sender needs to be able to acquire buffer_lock in order to notify the receiver, so everything would come to a halt.
问题在于,我们对共享缓冲区状态有三个操作必须转换为前后操作:(1) 测试共享缓冲区中是否有空间;(2) 如果没有,则进入休眠状态直到有空间;(3) 释放共享锁,以便另一个线程腾出空间。如果这三个操作不是前后操作,则会出现丢失通知问题的风险。
The problem is that we have three operations on the shared buffer state that must be turned into a before-or-after action: (1) testing if there is room in the shared buffer, and (2) if not, going to sleep until there is room and (3) releasing the shared lock so that another thread can make room. If these three operations are not a before-or-after action, then the risk of the lost notification problem arises.
使用WAIT和NOTIFY的伪代码说明了模块化和锁之间的矛盾。细心的读者可能会问:如果问题是由于并发线程运行多步骤操作而导致的竞争条件(例如,发送方 (1) 测试空间并 (2) 调用WAIT,同时接收方 (1) 腾出空间并 (2) 调用NOTIFY),为什么我们不通过在它们周围放置一个锁来将这些步骤变成前后操作?问题是,我们希望变成原子操作的步骤是发送方 (1) 比较in和out和 (2) 将线程表条目从RUNNING更改为WAITING 的示例。但变量in和out由发送方和接收方模块拥有,而线程表由线程管理器模块拥有。这些不仅是独立的模块,而且线程管理器可能在内核中。那么谁应该拥有创建前后操作的锁?我们不能让内核的正确性依赖于用户程序是否正确设置和释放内核锁,也不能让内核的正确性依赖于用户锁是否正确实现。这里真正的问题是,创建前后操作所需的锁必须保护一个不变量,即应用程序拥有的状态与系统拥有的状态之间的关系。
The pseudocode that uses WAIT and NOTIFY illustrates a tension between modularity and locks. An observant reader might ask: if the problem is a race condition caused by having concurrent threads running multistep actions (e.g., the sender (1) tests for space and (2) calls WAIT, at the same time that the receiver (1) makes space and (2) calls NOTIFY), why don’t we just make those steps into before-or-after actions by putting a lock around them? The problem is that the steps we would like to make into an atomic action are for the example of the sender (1) comparing in and out and (2) changing the thread table entry from RUNNING to WAITING. But the variables in and out are owned by the sender and receiver modules, whereas the thread table is owned by the thread manager module. These are not only separate modules, but the thread manager is probably in the kernel. So who should own the lock that creates the before-or-after action? We can’t allow correctness of the kernel to depend on a user program properly setting and releasing a kernel lock, nor can we allow the correctness of the kernel to depend on a user lock being correctly implemented. The real problem here is that the lock needed to create the before-or-after action must protect an invariant that is a relation between a piece of application-owned state and a piece of system-owned state.
设计人员已经确定了各种解决方案,以创建前后操作来消除丢失通知。所有这些解决方案的共同特性是,它们带来了一些额外的线程状态,这些状态表征了线程在线程表锁(即thread_table_lock)的保护下等待的事件。通过扩展WAIT和NOTIFY的语义以包括检查和修改变量event_name,可以避免丢失通知。我们将该解决方案作为练习留给读者,而是提供基于除WAIT和NOTIFY之外的原语的更简单、更广泛使用的解决方案。问题集13介绍了一种解决方案,其中额外的线程状态保存在所谓的条件变量中,Birrell的教程很好地解释了如何使用线程和条件变量进行编程[进一步阅读建议5.3.1 ]。侧边栏5.7和问题集12描述了一种解决方案,其中额外的线程状态是一个称为信号量的变量。在本节中,我们描述了一种解决方案(旨在特别容易推理),其中额外的线程状态位于名为eventcounts和serial rs 的变量中 [进一步阅读建议 5.5.4 ]。在所有这些解决方案中,额外的线程状态必须在应用程序(例如,SEND和RECEIVE )和线程管理器之间共享,因此WAIT / NOTIFY 、条件变量、信号量、eventcounts 和其他类似解决方案的语义都包含不明显的、有时非常微妙的方面。Lampson 和 Redell 对其中一些微妙的问题进行了很好的讨论 [进一步阅读建议 5.5.5 ]。
Designers have identified various solutions to the problem of creating before-or-after actions to eliminate lost notifications. A general property of all these solutions is that they bring some additional thread state that characterizes the event for which the thread is waiting under protection of the thread table lock (i.e., thread_table_lock). By extending the semantics of WAIT and NOTIFY to include examining and modifying the variable event_name, it is possible to avoid lost notifications. We leave that solution as an exercise to the reader and instead offer simpler and more widely used solutions based on primitives other than WAIT and NOTIFY. Problem set 13 introduces a solution in which the additional thread state is held in what is called a condition variable, and Birrell’s tutorial does a nice job of explaining how to program with threads and condition variables [Suggestions for Further Reading 5.3.1]. Sidebar 5.7 and problem set 12 describe a solution in which the additional thread state is a variable known as a semaphore. In this section we describe a solution (one that is intended to be particularly easy to reason about) in which the additional thread state is found in variables called eventcounts and sequencers [Suggestions for Further Reading 5.5.4]. In all of these solutions, the additional thread state must be shared between the application (e.g., SEND and RECEIVE) and the thread manager, so the semantics of WAIT/NOTIFY, condition variables, semaphores, eventcounts, and other similar solutions all contain non-obvious and sometimes quite subtle aspects. A good discussion of some of these subtle issues is provided by Lampson and Redell [Suggestions for Further Reading 5.5.5].
边栏 5.7 使用信号量避免丢失通知问题
Sidebar 5.7 Avoiding the Lost Notification Problem with Semaphores
信号量是具有特殊语义的计数器,用于序列协调。信号量支持两种操作:
Semaphores are counters with special semantics for sequence coordination. A semaphore supports two operations:
-DOWN(信号量):如果信号量> 0,则减少信号量并返回,否则等待另一个线程增加信号量,然后尝试再次减少。
- DOWN (semaphore): if semaphore > 0 decrement semaphore and return otherwise, wait until another thread increases semaphore and then try to decrement again.
-UP(信号量):增加信号量,唤醒所有等待信号量的线程,然后返回。
- UP (semaphore): increment semaphore, wake up all threads waiting on semaphore, and return.
信号量的灵感来源于铁路系统用于协调共用轨道使用的信号量。如果信号量为关闭状态,列车必须停止,直到轨道上的当前列车离开轨道并发出信号量。如果信号量只能取值 0 和 1(有时称为二进制信号量),则UP和DOWN 的操作类似于铁路信号量。信号量是由荷兰程序员 Edgar Dijkstra 引入计算机系统的(另见边栏 5.2),他将DOWN操作称为 P(“pakken”,荷兰语中表示抓取),将 UP 操作称为 V(“verhogen”,荷兰语中表示发出)[进一步阅读建议 5.5.1 ]。
Semaphores are inspired by the ones that the railroad system uses to coordinate the use of a shared track. If a semaphore is down, trains must stop until the current train on the track leaves the track and raises the semaphore. If a semaphore can take on only the values 0 and 1 (sometimes called a binary semaphore), then UP and DOWN operate similar to a railroad semaphore. Semaphores were introduced in computer systems by the Dutch programmer Edgar Dijkstra (see also Sidebar 5.2), who called the DOWN operation P (“pakken”, for grabbing in Dutch) and the UP operation V (“verhogen”, for raising in Dutch) [Suggestions for Further Reading 5.5.1].
DOWN和UP的实现必须在操作之前或之后,以避免丢失通知问题。此属性可以采用与 eventcount 操作相同的方式实现:
The implementation of DOWN and UP must be before-or-after actions to avoid the lost notification problem. This property can be realized in the same way as the eventcount operations:
1 构造 信号量
1 structure semaphore
2 整数 计数
2 integer count
3
3
4 程序 UP(信号量 引用 sem)
4 procedure UP (semaphore reference sem)
5 获取(线程表锁)
5 ACQUIRE (thread_table_lock)
6 sem.计数← sem.计数+ 1
6 sem.count ← sem.count + 1
7 for i from 0 to 6 do //唤醒所有等待此信号量的线程
7 for i from 0 to 6 do // wakeup all threads waiting on this semaphore
8 如果 thread_table [ i ]. state = WAITING 且 thread_table [ i ]. sem = sem 那么
8 if thread_table[i].state = WAITING and thread_table[i].sem = sem then
9 线程表[ i ].状态←可运行
9 thread_table[i].state ← RUNNABLE
10 发布(thread_table_lock)
10 RELEASE (thread_table_lock)
11 程序 DOWN(信号量 引用 sem)
11 procedure DOWN (semaphore reference sem)
12 获取(thread_table_lock)
12 ACQUIRE (thread_table_lock)
13 id ← GET_THREAD_ID ()
13 id ← GET_THREAD_ID()
14 thread_table [ id ]. sem ← sem //记录正在等待的信号量ID
14 thread_table[id].sem ← sem // record the semaphore ID is waiting on
15 while sem.count < 1 do //当sem < 1 时放弃处理器
15 while sem.count < 1 do // give up the processor when sem < 1
16 线程表[ id ].状态←等待
16 thread_table[id].state ← WAITING
17 ENTER_PROCESSOR_LAYER ( id,CPUID )
17 ENTER_PROCESSOR_LAYER (id, CPUID)
18 sem.count ← sem.count − 1
18 sem.count ← sem.count − 1
19 发布(thread_table_lock)
19 RELEASE (thread_table_lock)
使用信号量,人们可以实现具有有界缓冲区的SEND和RECEIVE ,而不会丢失通知(参见问题集12)。
Using semaphores, one can implement SEND and RECEIVE with a bounded buffer without lost notifications (see problem set 12).
事件计数和序列器是发送方、接收方和线程管理器之间共享的变量。它们使用以下接口进行操作:
Eventcounts and sequencers are variables that are shared among the sender, the receiver, and the thread manager. They are manipulated using the following interface:
AWAIT ( eventcount , value ) 是将 eventcount 与 value 进行比较的前后操作。如果eventcount超过value, AWAIT将返回给其调用者。如果eventcount小于或等于value, AWAIT将调用线程的状态更改为WAITING,将value和eventcount的名称放入线程表中此线程的条目中,并让出其处理器。
AWAIT(eventcount, value) is a before-or-after action that compares eventcount to value. If eventcount exceeds value, AWAIT returns to its caller. If eventcount is less than or equal to value, AWAIT changes the state of the calling thread to WAITING, places value and the name of eventcount in this thread’s entry in the thread table, and yields its processor.
ADVANCE ( eventcount ) 是一种前后操作,它将eventcount加一,然后在线程表中搜索正在等待此 eventcount 的线程。对于它找到的每个线程,如果eventcount现在超过了该线程正在等待的值,则ADVANCE会将该线程的状态更改为RUNNABLE。
ADVANCE(eventcount) is a before-or-after action that increments eventcount by one and then searches the thread table for threads that are waiting on this eventcount. For each one it finds, if eventcount now exceeds the value for which that thread is waiting, ADVANCE changes that thread’s state to RUNNABLE.
TICKET ( sequencer ) 是一个前后操作,返回一个非负值,每次调用时该值都会加一。两个线程同时在同一sequencer上调用TICKET会收到不同的值,返回值的顺序与TICKET的执行时间顺序相对应。
TICKET(sequencer) is a before-or-after action that returns a non-negative value that increases by one on each call. Two threads concurrently calling TICKET on the same sequencer receive different values, and the ordering of the values returned corresponds to the time ordering of the execution of TICKET.
READ( eventcount或sequencer)是向调用者返回变量当前值的前后操作。显式READ过程可确保 eventcounts 和sequencer 的前后原子性,因为其值可能会增长到大于内存单元。
READ(eventcount or sequencer) is a before-or-after action that returns to the caller the current value of the variable. Having an explicit READ procedure ensures before-or-after atomicity for eventcounts and sequencers whose value may grow to be larger than a memory cell.
为了实现该接口,调度程序会跳过处于WAITING状态的任何线程。
To implement this interface, the scheduler skips over any thread that is in the WAITING state.
要理解这些原语,首先考虑单个发送者和接收者的有界缓冲区的实现。使用 eventcounts,我们可以重写图 5.6中的有界缓冲区的实现,如图5.29所示。SEND等待,直到缓冲区中有空间。由于AWAIT实现了等待操作,因此图 5.29中的代码不需要图 5.6中等待成功的while循环。一旦有空间,发送者就会将消息添加到缓冲区并使用 ADVANCE 递增,这可能会将接收者的状态更改为RUNNABLE。由于AWAIT和ADVANCE操作是前后操作,因此不会发生丢失通知的问题。
To understand these primitives, consider first the implementation of a bounded buffer for a single sender and receiver. Using eventcounts, we can rewrite the implementation of the bounded buffer from Figure 5.6 as shown in Figure 5.29. SEND waits until there is space in the buffer. Because AWAIT implements the waiting operation, the code in Figure 5.29 does not need the while loop that waits for success in Figure 5.6. Once there is space, the sender adds the message to the buffer and increments in using ADVANCE, which may change the receiver’s state to RUNNABLE. Because AWAIT and ADVANCE operations are before-or-after actions, the lost notification problem cannot occur.
图 5.29使用事件计数为单个发送方和接收方实现的虚拟通信链路。
Figure 5.29 An implementation of a virtual communication link for a single sender and receiver using eventcounts.
同样,由于AWAIT实现了等待操作,因此接收方的实现也很简单。RECEIVE会一直等待,直到缓冲区中有消息为止。如果是这样,接收方会从缓冲区中提取消息并使用ADVANCE递增,这可能会将发送方的状态更改为RUNNABLE。
Again, because AWAIT implements the waiting operation, the receiver implementation is also simple. RECEIVE waits until there is a message in the buffer. If so, the receiver extracts the message from the buffer and increments out using ADVANCE, which may change the sender’s state to RUNNABLE.
图 5.30显示了多个发送方和一个接收方的情况的实现。为了确保多个发送方不会尝试写入缓冲区中的同一位置,我们需要协调他们的操作。我们可以使用 TICKET原语来解决这个问题,这只需要更改SEND 。图 5.30和图 5.29之间的主要区别在于发送方必须获得票据才能序列化他们的操作。SEND从序列器发送方获得一张票据(第7行)。TICKET的运行方式类似于面包店或邮局中的“取号”机。返回的票据会创建发送方的顺序并告诉每个发送方其在顺序中的位置。然后,每个发送方线程通过调用AWAIT等待,直到轮到自己,并将发送的事件计数和其TICKET ( t )的值作为参数传递(第8行)。当sent达到发送者线程票据上的编号时,该发送者线程将继续执行下一步,即等待直到缓冲区中有空间(第9行),然后才将其项目添加到缓冲区中的条目中。由于TICKET是前后操作,因此没有两个线程会获得相同的编号。同样,由于AWAIT和ADVANCE操作是前后操作,因此不会发生丢失通知的问题。
Figure 5.30 shows the implementation for the case of multiple senders with a single receiver. To ensure that several senders don’t try to write into the same location within the buffer, we need to coordinate their actions. We can use the TICKET primitive to solve this problem, which requires changes only to SEND. The main difference between Figure 5.30 and Figure 5.29 is that the senders must obtain a ticket to serialize their operations. SEND obtains a ticket from the sequencer sender (line 7). TICKET operates likes the “take a number” machine in a bakery or post office. The returned tickets create an ordering of senders and tell each sender its position in the order. Each sender thread then waits until its turn comes up by invoking AWAIT, passing as arguments the eventcount sent and the value of its TICKET (t) (line 8). When sent reaches the number on the ticket of a sender thread, that sender thread proceeds to the next step, which is to wait until there is space in the buffer (line 9), and only then does it add its item to its entry in buffer. Because TICKET is a before-or-after action, no two threads will get the same number. Again, because AWAIT and ADVANCE operations are before-or-after actions, the lost notification problem cannot happen.
图 5.30使用事件计数为多个发送方实现的虚拟通信链路。
Figure 5.30 An implementation of a virtual communication link for several senders using eventcounts.
同样,此解决方案不使用图 5.6中的等待成功的while循环。如果有多个发送方,要理解为什么这是正确的有点棘手。AWAIT保证在调用AWAIT之后的某个时刻, eventcount会超过value ,但如果有其他并发线程可能会增加 value ,那么当AWAIT的调用者重新获得控制权时,eventcount可能不再超过value 。正确的观点是,从AWAIT返回暗示AWAIT正在等待的条件为真,并且可能仍然为真,但调用AWAIT的程序必须再次检查以确保。
Again, this solution doesn’t use a while loop that waits for the success in Figure 5.6. With multiple senders, it is slightly tricky to see why this is correct. AWAIT guarantees that eventcount exceeded value at some instant after AWAIT was called, but if there are other, concurrent, threads that may increment value, by the time AWAIT’s caller gets control back, eventcount may no longer exceed value. The proper view is that a return from AWAIT is a hint that the condition AWAIT was waiting for was true and it may still be true, but the program that called AWAIT must check again to be sure.
当有多个发送者时,似乎会出现问题。假设缓冲区已满(例如in和out都是 10),并且有两个发送线程都在等待一个槽位变空。运行的第一个发送线程将吸收缓冲区条目并将in更改为 11。第二个发送线程会发现in是 11,但out也是 11,因此从它的角度来看,AWAIT返回了in = out。但它不会重新检查条件。仔细检查代码会发现这种情况永远不会发生,因为第二个发送者实际上是在等待序列发送者返回的票证,而不是等待in < out。永远不会有两个发送者等待同一条件变为真。如果程序使用了不同的方式协调发送者,则可能需要在AWAIT返回时重新测试条件。这是使用并发线程进行编程需要非常小心的另一个例子。
The issue seems to arise when there are multiple senders. Suppose the buffer is full (say in and out are 10) and there are two sending threads that are both waiting for a slot to become empty. The first one of those sending threads that runs will absorb the buffer entry and change in to 11. The second sending thread will find that in is 11 but out is also 11, so from its point of view, AWAIT returned with in = out. Yet it doesn’t recheck the condition. Closer inspection of the code reveals that this case can never arise because the second sender is actually waiting its turn on the ticket returned by the sequencer sender, not waiting for in < out. There is never a case in which two senders are waiting for the same condition to become true. If the program had used a different way of coordinating the senders, it might have required a retest of the condition when AWAIT returns. This is another example of why programming with concurrent threads requires great care.
如果实现还必须与多个接收器一起工作,那么在RECEIVE中需要一个类似的序列器来允许接收器自行序列化。
If the implementation must also work with multiple receivers, then a similar sequencer is needed in RECEIVE to allow the receivers to serialize themselves.
有了这些用于序列协调的附加原语,我们可以将线程的生命周期描述为具有四种状态的状态机(参见图 5.31 )。线程管理器创建一个处于RUNNABLE状态的线程。线程管理器调度其中一个可运行线程并为其分派处理器;该线程更改为RUNNING状态。通过调用YIELD,线程重新进入RUNNABLE状态,管理器可以选择另一个线程并分派给它。或者,线程可以通过调用EXIT_THREAD从RUNNING状态更改为NOT_ALLOCATED状态。或者,正在运行的线程可以通过调用AWAIT进入WAITING状态(当它无法继续执行直到发生某个事件时)。另一个线程可以通过调用ADVANCE使等待线程再次进入RUNNABLE状态。
With these additional primitives for sequence coordination, we can describe the life of a thread as a state machine with four states (see Figure 5.31). The thread manager creates a thread in the RUNNABLE state. The thread manager schedules one of the runnable threads and dispatches a processor to it; that thread changes to the RUNNING state. By calling YIELD, the thread reenters the RUNNABLE state, and the manager can select another thread and dispatch to it. Alternatively, a thread can change from the RUNNING state to the NOT_ALLOCATED state by calling EXIT_THREAD. Or a running thread can enter the WAITING state by calling AWAIT when it cannot proceed until some event occurs. Another thread, by calling ADVANCE, can make the waiting thread enter the RUNNABLE state again.
图 5.31线程状态图。在RUNNABLE、WAITING或RUNNING三种状态中的任意一种下,对DESTROY_THREAD的调用都会设置一个标志,该标志会导致调度程序在线程下次进入RUNNING状态时强制将状态设置为NOT_ALLOCATED。
Figure 5.31 Thread state diagram. In any of the three states RUNNABLE, WAITING, or RUNNING, a call to DESTROY_THREAD sets a flag that causes the scheduler to force the state to NOT_ALLOCATED the next time that thread would have entered the RUNNING state.
这些原语为程序员创建死锁提供了新的机会。例如,线程 A 可能对期望线程 B 能够ADVANCE 的事件计数调用AWAIT,但线程 B 可能正在AWAIT一个只有线程 A 能够ADVANCE 的事件计数。事件计数和票据可以消除丢失通知,但操作它们的原语仍必须谨慎使用。问题集11的最后几个问题通过比较使用NOTIFY和ADVANCE实现的简单 Web 服务来探讨丢失通知的问题。
These primitives provide new opportunities for a programmer to create deadlocks. For example, thread A may call AWAIT on an eventcount that it expects thread B to ADVANCE, but thread B may be AWAITing an eventcount that only thread A is in a position to ADVANCE. Eventcounts and tickets can eliminate lost notifications, but the primitives that manipulate them must still be used with care. The last few questions of problem set 11 explore the problem of lost notifications by comparing a simple Web service implemented using NOTIFY and ADVANCE.
为了实现AWAIT、ADVANCE、TICKET和READ,我们扩展了线程管理器,如下所示。YIELD不需要任何修改即可支持AWAIT和ADVANCE,但我们必须扩展thread_table以记录处于WAITING状态的线程正在等待的事件计数的引用:
To implement AWAIT, ADVANCE, TICKET, and READ we extend the thread manager as follows. YIELD doesn’t require any modifications to support AWAIT and ADVANCE, but we must extend the thread_table to record, for threads in the WAITING state, a reference to the eventcount on which it is waiting:
共享结构 thread_table [7]
shared structure thread_table[7]
整数 topstack //堆栈指针的值
integer topstack // value of the stack pointer
整数 状态 //等待、可运行、终止、未分配
integer state // WAITING, RUNNABLE, TERMINATE, NOT_ALLOCATED
eventcount 引用 事件 // 如果正在等待,则为我们正在等待的事件计数
eventcount reference event // if waiting, the eventcount we are waiting on
长整 数值 //如果等待,我们正在等待什么值
long integer value // if waiting, what value are we waiting for
共享 锁 实例 thread_table_lock //锁定以保护thread_table的条目
shared lock instance thread_table_lock // lock to protect entries of thread_table
字段event是对事件计数的引用,因此线程管理器和调用线程可以共享它。这种共享是解决前面提到的紧张关系的关键:它允许调用线程变量受到线程管理器锁的保护。
The field event is a reference to an eventcount so that the thread manager and the calling thread can share it. This sharing is the key to resolving the tension mentioned earlier: it allows a calling thread variable to be protected by the thread manager lock.
我们通过测试事件计数来实现AWAIT ,如果测试失败则将状态设置为WAITING ,并调用ENTER_PROCESSOR_LAYER切换到处理器线程:
We implement AWAIT by testing the eventcount, setting the state to WAITING if the test fails, and calling ENTER_PROCESSOR_LAYER to switch to the processor thread:
1 结构 事件计数
1 structure eventcount
2 长整数 计数
2 long integer count
3 程序 AWAIT (事件计数 引用 事件,值)
3 procedure AWAIT (eventcount reference event, value)
4 获取(thread_table_lock)
4 ACQUIRE (thread_table_lock)
5 id ← GET_THREAD_ID ()
5 id ← GET_THREAD_ID ()
6 线程表[ id ].事件←事件
6 thread_table[id].event ← event
7 线程表[ id ].值←值
7 thread_table[id].value ← value
8 如果 event.count ≤ value 则 thread_table [ id ]. state ← WAITING
8 if event.count ≤ value then thread_table[id].state ← WAITING
9 ENTER_PROCESSOR_LAYER ( ID,CPUID )
9 ENTER_PROCESSOR_LAYER (id, CPUID)
10 发布(thread_table_lock)
10 RELEASE (thread_table_lock)
除非在前后操作中eventcount event超过 value,否则AWAIT的此实现将释放其处理器。与以前一样,线程数据结构受锁thread_table_lock保护。具体而言,锁可确保第8行将event与value进行比较以及状态从RUNNING变为WAITING的潜在变化是前后操作的两个步骤,这两个步骤必须完全发生在任何可能更改event值或此线程状态的并发调用ADVANCE之前或之后。因此,锁可防止丢失通知。
This implementation of AWAIT releases its processor unless eventcount event exceeds value in a before-or-after action. As before, the thread data structures are protected by the lock thread_table_lock. In particular, the lock ensures that the line 8 comparison of event with value and the potential change of state from RUNNING to WAITING are two steps of a before-or-after action that must occur either completely before or completely after any concurrent call to ADVANCE that might change the value of event or the state of this thread. The lock thus prevents lost notifications.
AWAIT中的ENTER_PROCESSOR_LAYER会导致控制权从此线程切换到处理器线程,这可能会泄露处理器。调用ENTER_PROCESSOR_LAYER的线程将其获取的锁传递给处理器线程,处理器线程将其传递给在此处理器上运行的下一个线程。因此,当调用AWAIT的线程持有thread_table_lock时,其他任何线程都无法修改线程状态。从该调用返回到ENTER_PROCESSOR_LAYER意味着其他某个线程调用了AWAIT或YIELD并且处理器线程已决定再次将处理器分配给此线程是合适的。该线程将返回到第 10行,释放thread_table_lock并返回到AWAIT的调用者。
ENTER_PROCESSOR_LAYER in AWAIT causes control to switch from this thread to the processor thread, which may give the processor away. The thread that calls ENTER_PROCESSOR_LAYER passes the lock it acquired to the processor thread, which passes it to the next thread to run on this processor. Thus, no other thread can modify the thread state while the thread that invoked AWAIT holds thread_table_lock. A return from that call to ENTER_PROCESSOR_LAYER means that some other thread called AWAIT or YIELD and the processor thread has decided it is appropriate to assign a processor to this thread again. The thread will return to line 10, release thread_table_lock, and return to the caller of AWAIT.
ADVANCE过程增加 eventcount事件,查找所有正在等待count且其值小于count的线程,并将它们的状态更改为RUNNABLE:
The ADVANCE procedure increments the eventcount event, finds all threads that are waiting on count and whose value is less than count’s, and changes their state to RUNNABLE:
1 程序 ADVANCE(事件计数 参考 事件)
1 procedure ADVANCE (eventcount reference event)
2 获取(thread_table_lock)
2 ACQUIRE (thread_table_lock)
3 事件.计数←事件. 计数+ 1
3 event.count ← event.count + 1
4 for i 从0到7 do
4 for i from 0 until 7 do
5 如果 thread_table [ i ] .state = WAITING且 thread_table [ i ] .event = event 并且
5 if thread_table[i].state = WAITINGand thread_table[i].event = event and
6 event.count > thread_table [ i ] .value 然后
6 event.count > thread_table[i].value then
7 线程表[ i ].状态←可运行
7 thread_table[i].state ← RUNNABLE
8 发布(thread_table_lock)
8 RELEASE (thread_table_lock)
ADVANCE实现的关键在于它使用thread_table_lock使ADVANCE成为前后操作。具体来说,第6行将event.count与thread [ i ]. value进行比较,以及第7行将调用AWAIT的线程的状态更改为RUNNABLE,现在是前后操作的两个步骤。调用AWAIT的线程不会干扰处于ADVANCE状态的线程。同样,调用ADVANCE 的线程不会干扰处于AWAIT状态的线程。此设置避免了AWAIT和ADVANCE 之间的竞争,从而避免了丢失通知的问题。
The key in the implementation of ADVANCE is that it uses thread_table_lock to make ADVANCE a before-or-after action. In particular, the line 6 comparison of event.count with thread[i].value and the line 7 change of state to RUNNABLE of the thread that called AWAIT are now two steps of a before-or-after action. No thread calling AWAIT can interfere with a thread that is in ADVANCE. Similarly, no thread calling ADVANCE can interfere with a thread that is in AWAIT. This setup avoids races between AWAIT and ADVANCE, and thus the lost notification problem.
ADVANCE只是使线程可运行;它不会调用ENTER_PROCESSOR_LAYER来释放其处理器。可运行线程不会运行,直到其他线程(可能是ADVANCE的调用者)调用YIELD或AWAIT,或者直到调度程序抢先释放处理器。
ADVANCE just makes a thread runnable; it doesn’t call ENTER_PROCESSOR_LAYER to release its processor. The runnable thread won’t run until some other thread (perhaps the caller of ADVANCE) calls YIELD or AWAIT, or until the scheduler preemptively releases a processor.
我们实现一个序列器和TICKET操作如下:
We implement a sequencer and the TICKET operation as follows:
1 个 结构 测序仪
1 structure sequencer
2 长整数 票
2 long integer ticket
3 程序 TICKET(序列 参考 )
3 procedure TICKET (sequencer reference s)
4 获取(thread_table_lock)
4 ACQUIRE (thread_table_lock)
5 t ← s .票
5 t ← s.ticket
6 s.ticket ← t + 1
6 s.ticket ← t + 1
7 发布(thread_table_lock)
7 RELEASE (thread_table_lock)
8 返回 t
8 return t
For completeness, the implementation of READ of an eventcount is as follows:
1 过程 READ ( eventcount 引用 事件)
1 procedure READ (eventcount reference event)
2 获取(thread_table_lock)
2 ACQUIRE (thread_table_lock)
3 e ←事件计数
3 e ← event.count
4 释放(线程表锁)
4 RELEASE (thread_table_lock)
5 返回 e
5 return e
为了确保READ提供前后原子性,READ被实现为使用锁的前后操作。sequencer的READ实现类似。
To ensure that READ provides before-or-after atomicity, READ is implemented as a before-or-after action using locks. The implementation of READ of a sequencer is similar.
回想一下,在图 5.8中,ACQUIRE本身是用自旋循环实现的,它不断轮询锁而不是释放处理器。鉴于ACQUIRE和RELEASE仅用于保护短指令序列,因此自旋实现是可以接受的。此外,在线程管理器内部,我们必须使用自旋锁,因为如果ACQUIRE ( thread_table_lock ) 调用AWAIT来等待锁解锁,那么线程管理器就会调用自身,但它不是设计为递归的。特别是,它没有可以停止递归的基本情况。
Recall that in Figure 5.8, ACQUIRE itself is implemented with a spin loop, polling the lock continuously instead of releasing the processor. Given that ACQUIRE and RELEASE are used to protect only short sequences of instructions, a spinning implementation is acceptable. Furthermore, inside the thread manager we must use a spinning lock because if ACQUIRE (thread_table_lock) were to call AWAIT to wait until the lock is unlocked, then the thread manager would be calling itself, but it isn’t designed to be recursive. In particular, it does not have a base case that could stop recursion.
有些线程必须与外部设备交互。例如,键盘管理器必须能够与键盘上的键盘控制器交互,键盘控制器是一个单独的专用处理器。我们将会看到,这种交互只是序列协调的另一个例子。
Some threads must interact with external devices. For example, the keyboard manager must be able to interact with the keyboard controller on the keyboard, which is a separate, special-purpose processor. As we shall see, this interaction is just another example of sequence coordination.
键盘控制器是一个专用处理器,它运行一个收集按键的程序。在本章的术语中,我们可以将键盘控制器视为一个使用其自己的专用处理器运行的单个硬连线线程。当用户按下一个键时,键盘控制器线程会拉高一条足够长的信号线以设置触发器,触发器是一种可以存储键盘管理器可以读取的一个位的数字电路。然后,控制器线程会降低信号线直到下一次(即直到下一次按键)。控制器和管理器之间共享的触发器使它们能够协调它们的活动。
The keyboard controller is a special-purpose processor, which runs a single program that gathers key strokes. In the terminology of this chapter, we can think of the keyboard controller as a single, hard-wired thread running with its own dedicated processor. When the user presses a key, the keyboard controller thread raises a signal line long enough to set a flip-flop, a digital circuit that can store one bit that the keyboard manager can read. The controller thread then lowers the signal line until next time (i.e., until the next keystroke). The flip-flop shared between the controller and the manager allows them to coordinate their activities.
事实上,使用共享触发器,键盘管理器可以运行类似于接收器中的等待输入循环:
In fact, using the shared flip-flop, the keyboard manager can run a wait-for-input loop similar to the one in the receiver:
1 当 FLIP_FLOP = 0时
1 while FLIP_FLOP = 0 do
2 收益()
2 YIELD ()
在这种情况下,键盘控制器设置触发器,键盘管理器读取触发器并对其进行测试。如果触发器未设置,则读取 0,管理器放弃。如果设置了触发器,则退出循环。读取触发器的副作用是将其重置为 0,从而提供一种协调锁。
In this case, the keyboard controller sets the flip-flop and the keyboard manager reads the flip-flip and tests it. If the flip-flop is not set, it reads 0, and the manager yields. If it is set, it falls out of the loop. As a side-effect of reading the flip-flop, it is reset to 0, thus providing a kind of coordination lock.
这里我们还有另一个轮询示例。在轮询中,线程不断检查另一个(可能是硬件)线程是否需要关注。在我们的示例中,每次调度程序为其提供机会时,键盘管理器都会运行,以查看是否有任何新键被按下。键盘管理器线程一直处于RUNNABLE状态,每当调度程序选择它运行时,线程都会检查触发器。
Here we have another example of polling. In polling, a thread keeps checking whether another (perhaps hardware) thread needs attention. In our example, the keyboard manager runs every time the scheduler offers it a chance, to see if any new keys have been pressed. The keyboard manager thread is continually in a RUNNABLE state, and whenever the scheduler selects it to run, the thread checks the flip-flop.
轮询有几个缺点,特别是如果由程序执行的话。如果很难预测事件发生的时间,那么线程轮询的频率就没有好的选择。如果轮询线程执行频率不高(例如,因为处理器正忙于执行其他线程),那么设备可能需要很长时间才能收到关注。在这种情况下,计算机系统可能看起来没有响应;例如,如果用户必须等待很长时间才能让计算机处理用户的键盘输入,那么用户的交互体验就会很差。另一方面,如果调度程序频繁选择轮询线程(例如,比用户输入的速度更快),线程会浪费处理器周期,因为通常没有可用的输入。最后,一些设备可能要求处理器在某个期限之前执行其管理器,否则设备将无法正常运行。例如,键盘控制器可能只有一个按键寄存器可用于与键盘管理器通信。如果用户在键盘管理器有机会运行并吸收第一个键击之前键入第二个键击,则第一个键击可能会丢失。
Polling has several disadvantages, especially if it is done by a program. If it is difficult to predict the time until the event will occur, then there is no good choice for how often a thread should poll. If the polling thread executes infrequently (e.g., because the processors are busy executing other threads), then it might take a long time before a device receives attention. In this case, the computer system might appear to be unresponsive; for example, if a user must wait a long time before the computer processes the user’s keyboard input, the user has a bad interactive experience. On the other hand, if the scheduler selects the polling thread frequently (e.g., faster than users can type), the thread wastes processor cycles, since often there will be no input available. Finally, some devices might require that a processor executes their managers by a certain deadline because otherwise the device won’t operate correctly. For example, the keyboard controller may have only a single keystroke register available to communicate with the keyboard manager. If the user types a second keystroke before the keyboard manager gets a chance to run and absorb the first one, the first keystroke may be lost.
这些缺点类似于没有明确的序列协调原语的缺点。如果没有AWAIT和ADVANCE,线程调度程序就不知道接收方线程何时必须运行;因此,接收方线程可能会不必要地重复调用YIELD。键盘管理器的情况类似;理想情况下,当控制器有需要处理的输入时,它应该能够提醒调度程序应该运行键盘管理器线程。我们希望使用序列协调原语将键盘管理器和键盘控制器编程为发送方和接收方,与图 5.30类似,只是我们可以使用适用于单个发送方和单个接收方的解决方案。不幸的是,控制器不能直接调用AWAIT和ADVANCE等过程;它只与处理器共享一个触发器。
These disadvantages are similar to the disadvantages of not having explicit primitives for sequence coordination. Without AWAIT and ADVANCE, the thread scheduler doesn’t know when the receiver thread must run; therefore, the receiver thread may make unnecessary, repeated calls to YIELD. This situation with the keyboard manager is similar; ideally, when the controller has input that needs to be processed, it should be able to alert the scheduler that the keyboard manager thread should run. We would like to program the keyboard manager and keyboard controller as a sender and a receiver using the primitives for sequence coordination, much as in Figure 5.30, except we could use a solution that works for a single sender and a single receiver. Unfortunately, the controller cannot invoke procedures such as AWAIT and ADVANCE directly; it shares only a single flip-flop with the processors.
诀窍是通过使用中断将轮询循环移到硬件中。键盘管理器通过将处理器的中断控制寄存器设置为ON来启用中断,向该处理器指示它必须从键盘控制器接收中断。然后,为了检查中断,处理器在每个指令周期开始时轮询共享触发器。当处理器发现共享触发器已更改时,处理器不会继续执行下一条指令,而是执行中断处理程序。换句话说,中断实际上是作为处理器内部的轮询循环实现的。处理器可以通过提供多个共享触发器和将不同的中断处理程序与每个触发器相关联的映射来支持多个中断。
The trick is to move the polling loop down into the hardware by using interrupts. The keyboard manager enables interrupts by setting a processor’s interrupt control register to ON, indicating to that processor that it must take interrupts from the keyboard controller. Then, to check for an interrupt, the processor polls the shared flip-flop at the beginning of every instruction cycle. When the processor finds that the shared flip-flop has changed, instead of proceeding to the next instruction, the processor executes the interrupt handler. In other words, interrupts are actually implemented as a polling loop inside a processor. A processor may support multiple interrupts by providing multiple shared flip-flops and a map that associates a different interrupt handler with each flip-flop.
键盘设备的简单中断处理程序调用ADVANCE,这是键盘控制器无法直接进行的调用,然后返回。被中断的线程继续运行,没有意识到发生了什么。但是下次任何线程调用YIELD或AWAIT时,线程管理器可以注意到键盘管理器线程已变为可运行状态。当它运行时,键盘管理器可以从设备复制击键,将它们转换为字符表示,将它们放入共享缓冲区(例如,用于接收器),然后等待下一个击键。
A simple interrupt handler for the keyboard device calls ADVANCE, the call that the keyboard controller is unable make directly, and then returns. The interrupted thread continues operation without realizing that anything happened. But the next time any thread calls YIELD or AWAIT, the thread manager can notice that the keyboard manager thread has become runnable. When it runs, the keyboard manager can then copy the keystrokes from the device, translate them to a character representation, put them in a shared buffer, (e.g., for the receiver), and wait for the next keystroke.
由于中断处理程序在一个指令时间内获得对处理器的控制权,因此它可用于满足最后期限。例如,键盘设备的中断处理程序可以立即将用户的击键复制到键盘管理器拥有的缓冲区中,而不是等到键盘管理器有机会运行。这样,键盘设备就可以立即为用户的下一次击键做好准备。为了满足这样的最后期限,中断处理程序通常比对ADVANCE的单次调用更为复杂。通常在中断处理程序中放置适度大小的代码块,以将数据移出设备缓冲区(例如,将击键移出键盘设备)或立即重新启动已关闭的 I/O 设备。
Because the interrupt handler gains control of a processor within one instruction time, it can be used to meet deadlines. For example, the interrupt handler for the keyboard device could copy the user’s keystrokes to a buffer owned by the keyboard manager immediately, instead of waiting until the keyboard manager gets a chance to run. This way the keyboard device is immediately ready for the user’s next keystroke. To meet such deadlines, interrupt handlers are usually more elaborate than a single call to ADVANCE. It is common to place modest-sized chunks of code in an interrupt handler to move data out of the device buffers (e.g., keystrokes out of the keyboard device) or immediately restart an I/O device that has turned itself off.
将代码放入中断处理程序中必须非常小心。中断处理程序在读取或写入共享变量时必须小心谨慎,因为它可能在任何一对指令之间被调用。因此,处理程序无法确定当前在处理器或其他处理器上运行的线程的状态。
Putting code in an interrupt handler must be done with great care. An interrupt handler must be cautious in reading or writing shared variables because it may be invoked between any pair of instructions. Therefore, the handler cannot be sure of the state of the thread currently running on the processor or on other processors.
由于中断处理程序不是由操作系统线程管理器管理的线程,因此必须谨慎地对中断处理程序和操作系统线程管理器进行编程。例如,线程管理器应在禁用中断的情况下对thread_table_lock调用ACQUIRE和RELEASE,否则可能会发生死锁,如我们在5.5.4 节中看到的那样。再举一个例子,中断处理程序永远不应调用AWAIT ,因为AWAIT可能会释放其处理器,而中断线程却措手不及——中断线程可能与中断无关,只是恰好在中断发生时在处理器上运行。另一方面,中断处理程序可以调用ADVANCE而不会导致任何问题。
Since interrupt handlers are not threads managed by the operating system thread manager, the interrupt handlers and the operating system thread manager must be carefully programmed. For example, the thread manager should call ACQUIRE and RELEASE on the thread_table_lock with interrupts disabled because otherwise a deadlock might occur, as we saw in Section 5.5.4. As another example, an interrupt handler should never call AWAIT because AWAIT may release its processor to the surprise of the interrupted thread—the interrupted thread may be a thread that has nothing to do with the interrupt but just happened to be running on the processor when the interrupt occurred. On the other hand, an interrupt handler can invoke ADVANCE without causing any problems.
处理由当前正在运行的线程引起的错误(例如,除以零的错误)的异常处理程序的限制不那么严格,因为处理程序代表当前在处理器上运行的线程运行。因此,在这种情况下,处理程序可以调用YIELD或AWAIT。
The restrictions on exception handlers that process errors caused by the currently running thread (e.g., a divide-by-zero error) are less severe because the handler runs on behalf of the thread currently running on the processor. So, in that case, the handler can call YIELD or AWAIT.
前面几节介绍了使用简单处理器在计算机中强制模块化的主要思想。本节将通过一个案例研究,介绍流行的 Intel x86 处理器如何提供强制模块化支持,以及常用的操作系统如何使用这种支持。下一节将介绍使用虚拟机在处理器级别强制模块化的案例研究。
The previous sections introduced the main ideas for enforcing modularity within a computer using a simple processor. This section presents a case study of how the popular Intel x86 processor provides support for enforced modularity and how commonly used operating systems use this support. The next section provides a case study of enforcing modularity at the processor level using virtual machines.
Intel x86 处理器架构是目前个人电脑、笔记本电脑和服务器微处理器使用最广泛的架构。x86 架构最初并不支持强制模块化。随着个人电脑、笔记本电脑和服务器上软件的稳健性变得越来越重要,Intel 设计师增加了对强制模块化的支持。Intel 设计师第一次尝试时并没有成功。x86 架构演变为包含强制模块化,这为技术的快速改进和设计复杂系统的挑战(包括市场压力)提供了一些很好的例子。
The Intel x86 processor architecture is currently the most widely used architecture for microprocessors of personal computers, laptops, and servers. The x86 architecture started without any support for enforced modularity. As the robustness of software on personal computers, laptops, and servers has become important, the Intel designers have added support for enforcing modularity. The Intel designers didn’t get it right on the first try. The evolution of x86 architecture to include enforced modularity provides some good examples of the rapid improvement in technology and challenges of designing complex systems, including market pressure.
1971 年,英特尔生产了第一款微处理器 4004,该处理器用于计算器,内置 2,250 个晶体管。4004 是 4 位处理器(即字长为 4 位,处理器使用 4 位宽的操作数进行计算),可以寻址多达 4 千字节的程序内存和 640 字节的数据内存。4004 提供的堆栈只能存储三个堆栈帧,没有中断,也不支持强制模块化。1971 年,硬件支持缺失的功能是众所周知的,但计算器几乎不需要它们。
In 1971 Intel produced its first microprocessor, the 4004, intended for calculators and implemented in 2,250 transistors. The 4004 is a 4-bit processor (i.e., the word size is 4 bits and the processor computes with 4-bit wide operands) and can address as much as 4 kilobytes of program memory and 640 bytes of data memory. The 4004 provides a stack that can store only three stack frames, no interrupts, and no support for enforcing modularity. Hardware support for the missing features was well known in 1971, but there is little need for them in a calculator.
后续处理器 8080(1974 年)是英特尔第一款用于个人计算机的微处理器,即 MITS 公司生产的 Altair。与 4004 不同,8080 是一款通用微处理器。8080 有 5,000 个晶体管:一个 8 位处理器,可以寻址高达 64 千字节的内存(16 位地址),不支持强制模块化。微软著名的比尔盖茨和保罗艾伦开发了一个可以在 Altair 上运行 BASIC 应用程序的程序。由于 Altair 一次不能运行多个简单程序,因此仍然不需要强制模块化。
The follow-on processor, the 8080 (1974), was Intel’s first microprocessor that was used in a personal computer, namely, the Altair, produced by MITS. Unlike the 4004, the 8080 is a general-purpose microprocessor. The 8080 has 5,000 transistors: an 8-bit processor that can address up to 64 kilobytes of memory (16-bit addresses), without support for enforcing modularity. Bill Gates and Paul Allen of Microsoft fame developed a program that could run BASIC applications on the Altair. Since the Altair couldn’t run more than a single, simple program at one time, there was still no need for enforcing modularity.
8080 之后,8086 于 1978 年问世,拥有 29,000 个晶体管。8086 是一个 16 位处理器,但具有 20 位总线地址,允许访问 1 兆字节的内存。为了从 16 位地址寄存器中生成 20 位地址,8086 有四个 16 位宽的段描述符。8086 将段描述符中的值与操作数中的 16 位地址组合如下:(16 位段描述符 × 16)+ 16 位地址,产生一个 20 位值。段描述符可以看作是指令操作数字段中的 16 位地址加上的内存地址。
The 8080 was followed by the 8086 in 1978, with 29,000 transistors. The 8086 is a 16-bit processor but with 20-bit bus addresses, allowing access to 1 megabyte of memory. To make a 20-bit address out of a 16-bit address register, the 8086 has four 16-bit wide segment descriptors. The 8086 combines the value in the segment descriptors and the 16-bit address in an operand as follows: (16-bit segment descriptor × 16) + 16-bit address, producing a 20-bit value. The segment descriptor can be viewed as a memory address to which the 16-bit address in the operand field of the instruction is added.
这些段的主要目的是扩展物理内存,而不是提供强制模块化。使用四个段描述符,程序可以一次引用总共 256 千字节的内存。如果程序需要寻址其他内存,程序员必须保存其中一个段描述符并用新值加载它。因此,为使用超过 256 千字节内存的 8086 编写程序很不方便,因为程序员必须跟踪段描述符和段数据的位置。
The primary purpose of these segments is to extend physical memory, as opposed to providing enforced modularity. Using the four segment descriptors, a program can refer to a total of 256 kilobytes of memory at one time. If a program needs to address other memory, the programmer must save one of the segment descriptors and load it with a new value. Thus, writing programs for the 8086 that use more than 256 kilobytes of memory is inconvenient because the programmer must keep track of segment descriptors and where segment data is located.
尽管 8086 的指令集与 8080 不同,但使用英特尔提供的翻译器,8080 的程序无需修改即可在 8086 上运行。正如我们将看到的,向后兼容性是英特尔处理器架构演进过程中反复出现的主题,也是英特尔成功的关键之一。
Although the 8086 has a different instruction repertoire from the 8080, programs for the 8080 could run on the 8086 unmodified using a translator provided by Intel. As we will see, backwards compatibility is a recurring theme in the evolution of the Intel processor architecture and one key to Intel’s success.
8088(1979 年)是 IBM 为满足其个人计算机处理器的要求而匆忙制造的。8088 与 8086 完全相同,只是它有一个 8 位数据总线,这使得处理器更便宜。当时大多数设备都有 8 位接口。微软为 IBM PC 提供了名为 Microsoft Disk Operating System (MS-DOS) 的操作系统。微软首先从西雅图计算机产品公司获得了该操作系统的许可,然后在 PC 发布前不久以 50,000 美元的价格收购了它。IBM PC 取得了商业上的成功,并开启了英特尔和微软的崛起。
The 8088 (1979) was put together hastily in response to IBM’s request for a processor for its personal computer. The 8088 is identical to the 8086, except that it has an 8-bit data bus, which made the processor less expensive. Most devices at that time had an 8-bit interface anyway. Microsoft supplied the operating system, named Microsoft Disk Operating System (MS-DOS), for the IBM PC. Microsoft first licensed the operating system from Seattle Computer Products and then acquired it shortly before the release of the PC for $50,000. The IBM PC was a commercial success and started the rise of Intel and Microsoft.
IBM PC 将 1 兆字节物理地址空间的前 640 千字节保留给程序,将前 360 千字节保留给输入和输出。设计者假设个人计算机上的任何程序都不需要超过 640 千字节的内存。为了降低价格和复杂性,8088 和 MS-DOS 都不支持强制模块化。
The IBM PC reserved the first 640 kilobytes of the 1-megabyte physical address space for programs and the top 360 kilobytes for input and output. The designers assumed that no programs on a personal computer needed more than 640 kilobytes of memory. To keep the price and complexity down, neither 8088 nor MS-DOS had any support for enforcing modularity.
由于 IBM PC 价格低廉,因此得到了广泛应用;越来越多的新软件被开发出来,现有软件的功能也变得更加丰富。此外,用户希望同时运行多个程序;也就是说,他们希望轻松地从一个程序切换到另一个程序,而不必退出程序并稍后重新启动。这些发展为英特尔和微软提出了三个新的设计目标:更大的地址空间以运行更复杂的程序、同时运行多个程序以及在它们之间强制模块化。不幸的是,最后一个目标与向后兼容性相冲突,因为现有程序充分利用了对物理内存的直接访问。
Because the IBM PC was inexpensive, it became widely used; more and more new software was developed for it, and the existing software became richer in features. In addition, users wanted to run several programs at the same time; that is, they wanted to easily switch from one program to another without having to exit a program and start it again later. These developments posed three new design goals for Intel and Microsoft: larger address spaces to run more complex programs, running several programs at once, and enforcing modularity between them. Unfortunately, the last goal conflicts with backwards compatibility because existing programs took full advantage of having direct access to physical memory.
英特尔为实现上述部分目标而做出的首次尝试是 80286 * (1982),这是一台 16 位处理器,可以寻址最多 16 兆字节的内存(24 位物理地址),并且拥有 134,000 个晶体管。80286 有两种模式,分别称为实模式和保护模式:在实模式下,旧的 8086 程序可以运行;在保护模式下,新程序可以通过改变段描述符的解释来利用强制模块化。在保护模式下,段描述符不定义段的基址(与实模式一样);而是从段描述符表中选择一个段描述符。这种通过间接方式解耦模块的设计原则的应用允许保护模式程序引用 2 14个段。此外,段选择器的低 2 位保留为权限位;2 位支持四种保护级别,以便操作系统设计人员可以利用多个保护环†。实际上,保护环的用处有限,操作系统设计人员仅使用两个环(用户和内核)来确保用户级程序不能访问内核专用段等。
Intel’s first attempt to achieve some of these goals was the 80286* (1982), a 16-bit processor that can address up to 16 megabytes of memory (24-bit physical addresses) and has 134,000 transistors. The 80286 has two modes, named real and protected: in real mode old 8086 programs can run; in protected mode new programs can take advantage of enforced modularity through a change in the interpretation of segment descriptors. In protected mode the segment descriptors don’t define the base address of a segment (as in real mode); rather, they select a segment descriptor out of a table of segment descriptors. This application of the design principle decouple modules with indirection allows a protected-mode program to refer to 214 segments. Furthermore, the low 2 bits of a segment selector are reserved for permission bits; 2 bits supports four protection levels so that operating systems designers can exploit several protection rings†. In practice, protection rings are of limited usefulness, and operating system designers use only two rings (user and kernel) to ensure, for example, that user-level programs cannot access kernel-only segments.
尽管英特尔销售了 1500 万台 80286,但它只部分实现了这三个目标。首先,与竞争处理器提供的 32 位地址空间相比,24 位太小了。其次,虽然从实模式切换到保护模式很容易,但没有简单的方法(除了利用处理器设计中不相关的功能)从保护模式切换回实模式。这一限制意味着操作系统无法轻松地在旧程序和新程序之间切换。第三,80286 推出后,花了数年时间才开发出一个可以利用 80286 提供的分段功能的操作系统 OS/2。OS/2 由 Microsoft 和 IBM 联合创建,目的是利用 80286 的所有保护模式功能。但当 Microsoft 对该项目感到担忧时,它放弃了 OS/2,将其交给了 IBM,转而专注于 Windows 2.0。大多数购买者并没有等待 IBM 和微软整合各自的操作系统,而是简单地将基于 80286 的 PC 视为可以使用更多内存且速度更快的 8086 PC。
Although Intel sold 15 million 80286s, it achieved the three goals only partially. First, 24 bits was small compared to the 32 bits of address space offered by competing processors. Second, although it is easy to go from real to protected mode, there was no easy way (other than exploiting an unrelated feature in the design of the processor) to switch from protected mode back to real. This restriction meant that an operating system could not easily switch between old and new programs. Third, it took years after the introduction of the 80286 to develop an operating system, OS/2, that could take advantage of the segmentation provided by the 80286. OS/2 was jointly created by Microsoft and IBM, for the purpose of taking advantage of all the protected-mode features of the 80286. But when Microsoft grew concerned about the project, it disowned OS/2, gave it to IBM, and focused instead on Windows 2.0. Most buyers didn’t wait for IBM and Microsoft to get their operating system acts together and instead simply treated the 80286-based PC as a faster 8086 PC that could use more memory.
与 80286 重叠的是,英特尔投入了超过 100 人年的时间设计一种功能齐全的基于段的处理器架构,即 i432。该处理器是一种从头开始的设计,旨在强制模块化并支持面向对象编程。基于段的架构包括对功能的直接支持,这是一种用于访问控制的保护技术(参见第 11 章 [在线])。最终的实现非常复杂,以至于无法安装在单个芯片上,并且运行速度比 80286 慢。它最终被放弃,不是因为它强制模块化,而是因为它过于复杂、缓慢,并且缺乏与 x86 处理器架构的向后兼容性。
Overlapping with the 80286, Intel invested over 100 person-years in the design of a full-featured segment-based processor architecture known as the i432. This processor was a ground-up design to enforce modularity and support object-oriented programming. The segment-based architecture included direct support for capabilities, a protection technique for access control (see Chapter 11 [on-line]). The resulting implementation was so complex that it didn’t fit on a single chip and it ran slower than the 80286. It was eventually abandoned, not because it enforced modularity, but because it was overly complex, slow, and lacked backward compatibility with the x86 processor architectures.
迫于摩托罗拉的市场压力(当时摩托罗拉正在销售支持基于页面的虚拟内存的 32 位处理器),英特尔放弃了 i432,并于 1985 年在 80286 之后推出了 80386。80386 拥有 270,000 个晶体管,解决了 80286 的主要缺点,同时仍与之向后兼容。80386 是 32 位处理器,可以引用最多 4 GB 的内存(32 位地址),并支持 32 位外部数据和地址总线。与 80286 的两种实模式和保护模式相比,80386 提供了一种额外的模式,称为虚拟实模式,允许多个实模式程序在完全相互保护的虚拟环境中同时运行。80386 的设计还允许单个段增长到 2 32字节,即物理内存的最大大小。在段内,80386 设计者增加了对虚拟内存的支持,使用分页,每个段都有单独的页表。操作系统设计者可以选择使用段式、页式或两者兼有的虚拟内存。
Under market pressure from Motorola, which was selling a 32-bit processor with support for page-based virtual memory, Intel scratched the i432 and followed the 80286 with the 80386 in 1985. The 80386 has 270,000 transistors and addresses the main shortcomings of the 80286, while still being backwards compatible with it. The 80386 is a 32-bit processor, which can refer to up to 4 gigabytes of memory (32-bit addresses) and supports 32-bit external data and address busses. Compared with the two real and protected modes of the 80286, the 80386 provides an additional mode, called virtual real mode, which allows several real-mode programs to run at the same time in virtual environments fully protected from one another. The 80386 design also allows a single segment to grow to 232 bytes, the maximum size of physical memory. Within a segment, the 80386 designers added support for virtual memory using paging with a separate page table for each segment. Operating system designers can choose to use virtual memory with segments, or pages, or both.
这种设计允许多个旧程序以虚拟实模式运行,每个程序都在自己的分页地址空间中。这种设计还允许旧程序访问比 80286 更多的内存,而不必强制使用多个段。此外,由于 80386 分段向后兼容 80286,因此 80286 程序和 Windows 2.0 的后继者 (Windows 3.0) 无需任何修改即可使用更大的段。由于这些原因,80386 立即大受欢迎,但过了一段时间才有 32 位操作系统面世。广泛使用的基于UNIX的开源系统 GNU/Linux 于 1991 年问世,微软的 Windows 3.1 和 IBM 的 OS/2 2.0 于 1992 年问世。所有这些系统都采用了强制模块化思想,该思想是在 20 世纪 60 年代和 70 年代的分时系统中率先提出的。
This design allows several old programs to run in virtual real mode, each in its own paged address space. This design also allows old programs to have access to more memory than on the 80286, without being forced to use multiple segments. Furthermore, because the 80386 segmentation was backwards compatible with the 80286, 80286 programs and the Windows 2.0 successor (Windows 3.0) could use the larger segments without any modification. For these reasons, the 80386 was a big hit immediately, but it took a while until 32-bit operating systems were available. GNU/Linux, a widely-used open-source UNIX-based system came out in 1991, and Microsoft’s Windows 3.1 and IBM’s OS/2 2.0 in 1992. All of these systems incorporated the enforced modularity ideas, pioneered in the time-sharing systems of the 1960s and 1970s.
1985 年之后,英特尔处理器架构扩展了新的指令,但核心指令库保持不变。主要的变化发生在底层。英特尔和其他公司想出了如何实现提供复杂 x86 指令库的处理器 - 一些指令是 1 个字节,而其他指令可以长达 17 个字节,这就是为什么文献称 x86 为复杂指令集计算机或 CISC,但运行速度仍与使用 RISC 指令库从头设计的处理器架构一样快。这项努力在性能方面取得了回报,但需要大量晶体管才能实现它。
After 1985, the Intel processor architecture was extended with new instructions, but the core instruction repertoire remained the same. The main changes occurred under the hood. Intel and other companies figured out how to implement processors that provide the complex x86 instruction repertoire—some instructions are 1 byte, and others can be up to 17 bytes long, which is why the literature calls the x86 a Complex Instruction Set Computer, or CISC, while still running as fast as processor architectures designed from scratch with a RISC instruction repertoire. This effort has paid off in terms of performance but has required a large number of transistors to achieve it.
图 5.32显示了 1970 年至 2008 年期间英特尔处理器晶体管数量的增长情况*。y轴为对数刻度,直线表明增长呈指数增长。奔腾最初被指定为 80586,但英特尔为了获得商标权而将 80586 重新命名为“奔腾”。这种增长是d(技术)/dt发挥作用的一个很好的例子(见边栏 1.6)。
Figure 5.32 shows the growth of Intel processors in terms of transistors over the period 1970–2008*. The y-axis is on a logarithmic scale, and the straight line suggests that the growth has been approximately exponential. The Pentium was originally designated the 80586, but Intel redesignated the 80586 the “Pentium” in order to secure a trademark. This growth is a nice example of d(technology)/dt in action (see Sidebar 1.6).
图 5.32英特尔处理器芯片中晶体管数量的增长。每个点上的标签是芯片的商业名称。(y轴上的对数刻度)。
Figure 5.32 Growth of the number of transistors in Intel processor chips. The label on each point is the commercial name of the chip. (Log scale on y-axis).
软件的增长也很大。图 5.33显示了 1991 年至 2008 年期间 Linux 内核代码行数的增长情况†。在该图中,y轴为线性刻度。可以看出,代码行数的增长很大,而且显示的只是内核。新硬件设备的设备驱动程序是这一增长的主要贡献者。
The growth in software is also large. Figure 5.33 shows the growth of the Linux kernel in terms of lines of code during the period 1991–2008†. In this graph, the y-axis is on a linear scale. As can be seen, the growth in terms of lines of code has been large, and what is shown is just the kernel. A large contributor to this growth is device drivers for new hardware devices.
图 5.33 Linux 内核代码行数的增长。每个点上的标签是 Linux 版本号。(y轴上的线性刻度)。
Figure 5.33 Growth of the number of lines of code in the Linux kernel. The label on each point is the Linux release number. (Linear scale on y-axis).
x86 的成功说明了不可动摇的基础规则的一个具体实例的重要性:提供向后兼容性。如果必须更改接口,请保留旧接口或使用新版本的接口模拟旧版本的接口,以便客户端无需修改即可继续工作。开发提供向后兼容性的模拟层的工作量通常比从头开始重新实现所有客户端要少得多。
The success of the x86 illustrates the importance of a specific instance of the unyielding foundations rule: provide backwards compatibility. If one must change an interface, keep the old interface around or simulate the old version of the interface using the new version of the interface, so that clients keep working without modifications. It is typically much less work to develop a simulation layer that provides backwards compatibility than to reimplement all of the clients from scratch.
对于处理器而言,向后兼容性尤为重要,因为旧版软件是处理器架构成功的一个重要因素。原因是修改旧版软件的成本很高——最初的程序员通常已经离职(或忘记了),并且没有很好地记录下来。经验表明,即使是很小的修改也有可能违反一些未记录的假设,因此需要有人完全理解旧程序,这几乎与编写一个全新的程序一样费力。因此,客户几乎总是会选择允许他们继续运行旧版软件而不做任何更改的架构。由于 x86 架构提供了向后兼容性,因此它能够在 RISC 处理器的竞争中生存下来。
For processors, backwards compatibility is particularly important because legacy software is a big factor in the success of a processor architecture. The reason is that legacy software is expensive to modify—the original programmers usually have departed (or forgotten about it) and have not documented it well. Experience shows that even minor modifications risk violating some undocumented assumptions, so it is necessary for someone to understand the old program completely, which takes almost as much effort as writing a completely new one. So customers will nearly always choose the architecture that allows them to continue to run legacy software unchanged. Because the x86 architecture provided backwards compatibility, it was able to survive the competition from RISC processors.
如今,我们看到传统软件场景正在从 32 位虚拟地址转变为 64 位虚拟地址。英特尔的 Itanium 架构由于不向后兼容而逐渐消失,而竞争对手 Advanced Micro Devices (AMD) 的 64 位 Athlon 则向后兼容目前市场上的数十亿台 x86 处理器。在撰写本文时,英特尔正在放弃 Itanium 架构并跟随 AMD。
Today we see the legacy software scenario being played out in the change from 32-bit virtual addresses to 64-bit virtual addresses. Intel’s Itanium architecture is gradually disappearing beneath the waves because it is not backwards compatible, while competitor Advanced Micro Devices (AMD)’s 64-bit Athlon is backwards compatible with the billion or so x86 processors currently in the field. At the time of writing, Intel is abandoning the Itanium architecture and following AMD.
向后兼容也可能适得其反。例如,施乐公司认为,与其将自己在研究实验室开发的工作站商业化,不如开发一款 PC 克隆版,后者更有前途,后者配有鼠标、窗口管理器和所见即所得编辑器 [进一步阅读建议 1.3.3 ]。史蒂夫·乔布斯看到了原型,并开发了一款同类产品 — Apple Macintosh。与 PC 相比,Macintosh 的优势如此巨大,以至于客户愿意购买它。(Macintosh 后来的发展则是另一个不太成功的故事。)
Backwards compatibility can also backfire. For example, Xerox decided it looked more promising to create a PC-clone rather than to commercialize a workstation that Xerox developed in its research lab, which had a mouse, a window manager, and a WYSIWYG editor [Suggestions for Further Reading 1.3.3]. Steve Jobs saw the prototype and developed an equivalent—the Apple Macintosh. The benefits of the Macintosh were so great compared to PCs that customers were willing to buy it. (The later evolution of the Macintosh is a different, less successful story.)
本章介绍了几种用于虚拟化处理器、内存和链路的高级抽象,以实现模块化。应用程序通过管理调用接口以及中断和异常处理程序与这些抽象进行交互。另一种方法是使用虚拟机。在这种方法中,尽可能使用真实的物理机器来实现自身的许多虚拟实例(包括其特权指令,例如加载和存储到页面映射地址寄存器)。也就是说,虚拟机使用真实机器 A 模拟机器 A 的许多实例。实现虚拟机的软件称为虚拟机监视器。本节将更详细地讨论虚拟机和虚拟机监视器。
This chapter has introduced several high-level abstractions to virtualize processors, memory, and links to enforce modularity. Applications interact with these abstractions through a supervisor-call interface, and interrupt and exception handlers. Another approach uses virtual machines. In this approach, a real physical machine is used as much as possible to implement many virtual instances of itself (including its privileged instructions, such as loading and storing to the page-map address register). That is, virtual machines emulate many instances of a machine A using a real machine A. The software that implements the virtual machines is known as a virtual machine monitor. This section discusses virtual machines and virtual machine monitors in more detail.
虚拟机在许多情况下很有用:
A virtual machine is useful in a number of situations:
同时运行多个客户操作系统。例如,在一台虚拟 Intel x86 机器上,可以运行 GNU/Linux 操作系统,而在另一台机器上可以运行 Windows/XP 操作系统。如果虚拟机监视器忠实地实现了 Intel x86(即指令、状态、保护级别、页表),则可以在监视器上运行 GNU/Linux、Windows/XP 及其应用程序,而无需修改。
To run several guest operating systems side by side. For example, on one virtual Intel x86 machine, one can run the GNU/Linux operating system, and on another one can run the Windows/XP operating system. If the virtual machine monitor implements the Intel x86 faithfully (i.e., instructions, state, protection levels, page tables), then one can run GNU/Linux, Windows/XP, and their applications on top of the monitor without modifications.
控制客户操作系统中的错误。由于客户操作系统在虚拟机内运行,因此客户操作系统中的错误不会影响其他虚拟机上的操作系统软件。此功能对于调试新操作系统或控制不稳定但对某些应用程序很重要的操作系统非常方便。
To contain errors in a guest operating system. Because the guest runs inside a virtual machine, errors in the guest operating system cannot affect the operating systems software on other virtual machines. This feature is handy for debugging a new operating system or for containing an operating system that is flaky but important for certain applications.
简化操作系统的开发。虚拟机监视器可以虚拟化物理硬件以提供更简单的接口,从而简化操作系统的开发。例如,虚拟机监视器可以将多处理器计算机变成几个单处理器计算机,以便为单处理器编写客户操作系统,从而简化协调。
To simplify development of operating systems. The virtual machine monitor can virtualize the physical hardware to provide a simpler interface, which may simplify the development of an operating system. For example, the virtual machine monitor may turn a multiprocessor computer into a few uniprocessor computers to allow the guest operating system to be written for a uniprocessor, which simplifies coordination.
虚拟机监视器可以通过两种方式实现。首先,可以在内核模式下直接在硬件上运行监视器,而客户操作系统则处于用户模式。其次,可以在主机操作系统上以用户模式将监视器作为应用程序运行。后者实现起来可能不太复杂,因为监视器可以利用主机操作系统提供的抽象,但只有主机操作系统转发监视器执行其工作所需的所有事件时才有可能。为简单起见,我们假设采用第一种方法(见图5.34);无论哪种情况,问题都是相同的。
Virtual machine monitors can be implemented in two ways. First, one can run the monitor directly on hardware in kernel mode, with the guest operating systems in user mode. Second, one can run the monitor as an application in user mode on top of a host operating system. The latter may be less complex to implement because the monitor can take advantage of the abstractions provided by the host operating systems, but it is only possible if the host operating system forwards all the events that monitor needs to perform its job. For simplicity, we assume the first approach (see Figure 5.34); the issues are the same in either case.
图 5.34虚拟机监视器提供两个虚拟机,每个虚拟机运行不同的客户机并运行各自的应用程序。
Figure 5.34 A virtual machine monitor providing two virtual machines, each running a different guest operating with its own applications.
To implement virtual machine, the virtual machine monitor must provide three primary functions:
1.虚拟化计算机。例如,如果客户操作系统将新值存储到页面映射地址寄存器中,则监视器必须让客户操作系统相信它可以这样做,即使客户操作系统在用户模式下运行。
1. Virtualizing the computer. For example, if a guest operating system stores a new value into the page-map address register, then the monitor must make the guest operating system believe that it can do so, even though the guest is running in user mode.
2.调度事件。例如,监视器必须将应用程序调用的中断、异常和监控程序调用转发到适当的客户操作系统。
2. Dispatch events. For example, the monitor must forward interrupts, exceptions, and supervisor calls invoked by the applications to the appropriate guest operating systems.
3.分配资源。例如,监视器必须在客户操作系统之间划分物理内存。
3. Allocate resources. For example, the monitor must divide physical memory among the guest operating systems.
如果所有指令都是可虚拟化的,那么虚拟化计算机就很容易。也就是说,所有允许客户机区分在物理机上运行和在虚拟机上运行的指令都必须导致监视器出现异常,以便监视器可以模拟预期的行为。此外,异常必须留下足够的信息,以便异常处理程序模拟指令并重新启动客户机操作系统,就好像它已经执行了该指令一样。
Virtualizing the computer is easy if all instructions are virtualizable. That is, all the instructions that allow a guest to tell the difference between running on the physical and running on a virtual machine must result in an exception to the monitor so that the monitor can emulate the intended behavior. In addition, the exception must leave enough information for the exception handler to emulate the instruction and restart the guest operating system as if it has executed the instruction.
考虑加载页面映射地址寄存器的指令。这些指令在用户模式和内核模式下的行为不同。在用户模式下,这些指令会导致非法指令异常(因为它们是特权指令),而在内核模式下,硬件会执行这些指令。如果客户操作系统调用这样的指令,例如,切换到客户机上的另一个应用程序,则监视器必须忠实地模拟该指令,以便应用程序使用正确的页面映射运行。因此,这种指令的要求是它会导致异常,以便监视器接收控制,它留下足够的信息以便监视器可以模拟它,并且监视器可以重新启动客户机,就像它执行了该指令一样。也就是说,客户机不应该能够分辨出监视器模拟了该指令。
Consider instructions that load the page-map address register. These instructions behave differently in user mode and kernel mode. In user mode, these instructions result in an illegal instruction exception (because they are privileged), and in kernel mode the hardware performs them. If a guest operating system invokes such an instruction, for example, to switch to another application on the guest, the monitor must emulate that instruction faithfully so that the application will run with the right page-map. Thus, a requirement for such an instruction is that it results in an exception so that the monitor receives control, that it leaves enough information around that the monitor can emulate it, and that the monitor can restart the guest as if it executed the instruction. That is, the guest should not be able to tell that the monitor emulated the instruction.
如果一条指令在内核模式下的行为与在用户模式下的行为不同,并且不会导致异常,则该指令称为不可虚拟化的。例如,在 Intel x86 处理器上,通过在名为EFLAGS的寄存器中设置中断启用位来启用中断。此指令在用户模式和内核模式下的行为不同。在用户模式下,该指令没有任何效果(即,处理器只是忽略它),但在内核模式下,该指令会设置 EFLAGS寄存器中的位并允许中断。如果客户操作系统在用户空间中调用此指令,它将不执行任何操作,但客户操作系统会假定它在内核模式下运行,并且该指令将启用中断。该指令是不可虚拟化指令的一个示例,处理此类指令需要更复杂的计划,这超出了本文的范围。Adams 和 Agesen 的论文对此进行了很好的解释 [进一步阅读建议 5.6.4 ]。
If an instruction behaves differently in kernel mode than in user mode and doesn’t result in an exception, then the instruction is called non-virtualizable. For example, on the Intel x86 processor enabling interrupts is done by setting the interrupt-enable bit in a register called EFLAGS. This instruction behaves differently in user mode and in kernel mode. In user mode, the instruction does not have any effect (i.e., the processor just ignores it), but in kernel mode, the instruction sets the bit in the EFLAGS register and allows interrupts. If a guest operating system invokes this instruction in user space, it will do nothing, but the guest operating system assumes that it is running in kernel mode and that the instruction will enable interrupts. This instruction is an example of a non-virtualizable instruction, and handling instructions like these requires a more sophisticated plan, which is beyond the scope of this text. The paper by Adams and Agesen explains it well [Suggestions for Further Reading 5.6.4].
在客户操作系统之间合理分配资源比通常的调度问题更具挑战性。例如,监视器必须猜测哪些物理内存块未被使用,以便它可以将这些块用于其他客户;监视器无法直接检查客户的空闲内存块列表。Waldspurger 的论文介绍了一种解决这个问题的好方法 [进一步阅读建议 5.6.3 ]。再例如,监视器必须猜测客户操作系统何时没有工作要做;监视器无法直接观察到客户处于空闲循环中。有关虚拟机的文献包含解决这些挑战的方案。
Allocating resources well among the guest operating systems is more challenging than the usual scheduling problem. For example, the monitor must guess which blocks of physical memory are not in use so that it can use those blocks for other guests; the monitor cannot directly inspect the guest’s list of free memory blocks. The paper by Waldspurger introduces a nice trick for addressing this problem [Suggestions for Further Reading 5.6.3]. As another example, the monitor must guess when a guest operating system has no work to do; the monitor cannot directly observe that the guest is in its idle loop. The literature on virtual machines contains schemes to address these challenges.
为了具体说明这些功能的实现挑战,我们来考虑一个实现自己的页表、将虚拟地址映射到物理地址的客户操作系统。我们假设这个客户操作系统运行在本书开发的处理器上。虚拟机监视器的目标是通过虚拟化本书中使用的示例处理器(参见第 2.1.2 节)来运行多个客户操作系统,并根据本章中记录的说明进行扩展。
To make concrete what the implementation challenges of these functions are, consider a guest operating system that implements its own page tables, mapping virtual addresses to physical addresses. Let’s assume that this guest operating system runs on the processor developed in this text. The goal of the virtual machine monitor is to run several guest operating system by virtualizing the example processor used in this book (see Section 2.1.2), extended with the instructions documented in this chapter.
为了允许每个客户操作系统寻址所有物理内存,但不能寻址其他客户的物理内存,虚拟机监视器必须保护客户的物理地址。实现此目的的一种方法是递归虚拟化地址。也就是说,客户机和虚拟机将应用程序虚拟地址转换为虚拟机地址;监视器将机器虚拟地址转换为物理地址。设计监视器的一个挑战是维护从应用程序虚拟地址到虚拟机到物理地址的映射。一般计划是让监视器模拟加载和存储到页面映射地址寄存器,并为每个虚拟机保留自己的转换映射,我们将其称为机器映射。
To allow each guest operating system to address all physical memory, but not other guests’ physical memory, the virtual machine monitor must guard the guest’s physical addresses. One way to do so is to virtualize addresses recursively. That is, the guest and virtual machine translate application virtual addresses to virtual machine addresses; the monitor translates machine virtual addresses to physical addresses. One challenge in designing the monitor is to maintain this mapping from application virtual to virtual machine to physical addresses. The general plan is for the monitor to emulate loads and stores to the page-map address register, and keep its own translation map per virtual machine, which we will refer to as the machine map.
当客户机调用存储指令到页面映射地址寄存器时,监视器可以推断客户机正在使用哪个虚拟机内存以及从虚拟到机器地址的映射。由于此指令具有特权,处理器将生成非法指令异常并将控制权移交给监视器。存储指令的参数包含页面映射的机器地址。监视器可以读取该内存并查看客户机计划使用哪个虚拟机内存以及客户机从虚拟到机器的映射是什么(包括权限)。
The monitor can deduce which virtual machine memory a guest is using and the mappings from virtual to machine addresses when the guest invokes a store instruction to the page-map address register. Because this instruction is privileged, the processor will generate an illegal-instruction exception and transfer control to the monitor. The argument to the store instruction contains the machine address of a page map. The monitor can read that memory and see which virtual machine memory the guest is planning to use and what the guest’s mappings from virtual to machine are (including the permissions).
对于每个机器页面(包括保存客户机页面映射的页面),监视器可以分配一个物理页面,并在机器映射中记录从虚拟地址到机器地址再到物理地址的转换及其权限。有了这些信息,监视器就可以构建一个新的页面映射,将客户的虚拟地址映射到物理地址,并将该新映射安装在真实页面映射地址寄存器中(由于监视器在内核模式下运行,因此此过程将会成功)。因此,尽管页面映射有两层(虚拟到机器和机器到物理),但物理处理器执行的转换只有一个级别:它使用监视器设置的新页面映射将应用程序虚拟地址直接转换为物理地址。为了有效地支持这种双重转换计划,Intel 和 AMD 增加了额外的硬件支持。
For each machine page (including the one that holds the guest page map), the monitor can allocate a physical page and record in the machine map the translation from virtual to machine to physical address, together with its permissions. Equipped with this information, the monitor can construct a new page map that maps the guest’s virtual addresses to physical addresses and install that new map in the real page-map address register (which will succeed since the monitor is running in kernel mode). Thus, although there are two layers of page maps (virtual to machine and machine to physical), the translation performed by the physical processor is only one level: it translates application virtual addresses directly to physical addresses, using the new page map set up by the monitor. To support this double translation plan efficiently, Intel and AMD have added additional hardware support.
作为最后一步,监视器可以在将数据存入页面映射地址寄存器之后的指令处恢复客户操作系统,从而让客户操作系统产生它直接更新了页面映射地址寄存器的错觉。现在客户操作系统和应用程序可以继续执行。
As the final step, the monitor can resume the guest operating system at the instruction after the store to the page-map addresses register, providing the illusion to the guest that it updated the page-map address register directly. Now the guest and the applications can continue execution.
如果客户机更改其页面映射(例如,切换到其他应用程序之一),则监视器将获知此事件,因为存储到页面映射地址寄存器将导致异常(因为该指令是特权指令)并调用监视器中的异常处理程序。异常处理程序通过如上所述更新物理页面映射地址寄存器来模拟此指令并恢复客户机。
If the guest changes its page map (e.g., it switches to one of its other applications), the monitor will learn about this event because the store to the page-map address register will result in an exception (because the instruction is privileged) and invoke an exception handler in the monitor. The exception handler emulates this instruction by updating the physical page-map address register as above and resumes the guest.
如果监视器想要切换到另一个客户操作系统,它只需将页面映射地址寄存器切换到新客户的页面映射,就像应用程序之间的切换一样。
If the monitor wants to switch to another guest OS, it can just switch the page-map address register to the new guest’s page map, like a switch between applications.
如果应用程序寻址的页面不属于其地址空间,则硬件将生成缺页异常,这将调用监视器中的异常处理程序。然后,监视器中的异常处理程序可以调用相应客户端的异常处理程序。客户端异常处理程序现在认为它直接从处理器接收到缺页异常,并且可以采取适当的措施。
If the application addresses a page that is not part of its address space, the hardware will generate a missing-page exception, which will invoke an exception handler in the monitor. Then, the exception handler in the monitor can invoke the exception handler of the appropriate guest. The guest exception handler now believes it received the missing-page exception directly from the processor, and it can take appropriate action.
有兴趣进一步了解该主题的读者可能会发现有关虚拟机的读物很有用[进一步阅读建议 5.6 ]。
A reader interested in learning more about this topic might find the readings on virtual machines useful [Suggestions for Further Reading 5.6].
5.1第 1 章讨论了应对复杂性的四种通用方法:模块化、抽象、层次化和分层。
5.1 Chapter 1 discussed four general methods for coping with complexity: modularity, abstraction, hierarchy, and layering.
5.1a虚拟内存使用上述四种方法中的哪一种作为其主要组织方案?
5.1a Which of those four methods does virtual memory use as its primary organizing scheme?
5.2Alyssa 正在尝试整理她关于虚拟内存系统的笔记,她突然想到虚拟内存系统可以作为命名系统进行有用的分析。她浏览了第 3 章并列出了一些有关命名系统的技术术语;该列表位于下方右侧。然后,她在左侧列出了虚拟内存系统中的一些机制。但她不确定哪种命名概念与哪种机制相匹配。请帮助 Alyssa 告诉她右侧的哪些字母适用于左侧的每个编号机制。
5.2 Alyssa is trying to organize her notes on virtual memory systems, and it occurred to her that virtual memory systems can usefully be analyzed as naming systems. She went through Chapter 3 and made a list of some technical terms about naming systems; that list is on the right, below. She then listed some mechanisms found in virtual memory systems on the left. But she isn’t sure which naming concept goes with which mechanism. Help Alyssa out by telling her which letters on the right apply to each numbered mechanism on the left.
| 1.页面地图 | a.搜索路径 |
| 2.虚拟地址 | b.命名网络 |
| 3.物理地址 | c.上下文引用 |
| 4. TLB 条目 | d.对象 |
| 5.页映射地址寄存器 | e.姓名 |
| f.背景 | |
| g.以上都不是 |
5.3Modest Mini Corporation 最畅销的计算机最多允许两个用户同时运行。其唯一的寻址架构功能是单个页面映射,它为处理器创建了一个简单的线性地址空间。此计算机的分时系统在运行一个用户之前会将一组内存块地址加载到页面映射中;要切换到另一个用户,它会将一组新的内存块地址重新加载到整个页面映射中。通常,属于一个用户的内存块集与属于另一个用户的内存块集没有重叠,但内存块 19 在每个用户的地址空间中始终被分配为页面 3,从而提供“通信区域”。
5.3 The Modest Mini Corporation’s best-selling computer allows at most two users to run at a time. Its only addressing architecture feature is a single page map, which creates a simple linear address space for the processor. The time-sharing system for this computer loads the page map with a set of memory block addresses before running a user; to switch to the other user, it reloads the entire page map with a new set of memory block addresses. Normally, the set of memory blocks belonging to one user has no overlap with the set of memory blocks belonging to the other user, except that memory block 19 is always assigned as page 3 in every user’s address space, providing a “communication region”.
5.3a保护和隐私显然是一个完全公开的通信区域的问题,但是,将该通信区域用于以下任何类型的数据是否还存在其他困难?
5.3a Protection and privacy are obviously a problem with a completely public communication area, but is there any other difficulty in using the communication region for any of the following types of data?
A. The character string name of the payroll file
B. An integer representing the number of names in the current payroll file
C. The virtual memory address, within the communication region, of another data item
D. The virtual memory address of a program that lies outside the communication region
E. A small program that is designed to remain within the communication region and execute there
5.3bBen Bitdiddle 认为,总是预先分配第 3 页的编程很麻烦。因此,他建议添加一个系统调用,将通信区域重新分配给调用用户地址空间的不同页面,同时不影响其他用户。这个建议会对您回答 5.3a 的问题产生什么影响?
5.3b Ben Bitdiddle has decided that programming with page 3 always preassigned is a nuisance. He has therefore proposed that a call to the system be added that reassigns the communication region to a different page of the calling user’s address space, while not affecting the other users. What effect would this proposal have on your answers to 5.3a?
1980-2-4b
1980-2-4b
5.4微内核相对于单片内核的一个优点是它减少了转换后备缓冲器的负载,从而提高了命中率,并提高了对性能的影响。正确还是错误?解释一下。
5.4 One advantage of a microkernel over a monolithic kernel is that it reduces the load on the translation look-aside buffer, and thereby increases its hit rate and its consequent effect on performance. True or False? Explain.
1994-1-3a
1994-1-3a
5.5Louis 编写了一个多线程程序,虽然有时会产生错误答案,但始终能完成。他怀疑存在竞争条件。以下哪种策略可以减少甚至消除 Louis 程序中的竞争条件?
5.5 Louis writes a multithreaded program, which produces an incorrect answer some of the time, but always completes. He suspects a race condition. Which of the following are strategies that can reduce, and with luck eliminate, race conditions in Louis’s program?
A。将多线程程序分成多个单线程程序,每个线程在其自己的地址空间中运行,并通过使用SEND和RECEIVE的通信链接在它们之间共享数据。
A. Separate a multithreaded program into multiple single-threaded programs, run each thread in its own address space, and share data between them via a communication link that uses SEND and RECEIVE.
C. Ensure that for each shared variable v, it is protected by some lock lv .
D. Ensure that all locks are acquired in the same order.
2006-1-4
2006-1-4
5.6 Which of the following statements about operating system kernels are true?
A。抢占式调度允许内核的线程管理器以有助于避免命运共享的方式运行应用程序。
A. Preemptive scheduling allows the kernel’s thread manager to run applications in a way that helps avoid fate sharing.
B. The kernel serves as a trusted intermediary between programs running on the same computer.
C。在提供虚拟内存的操作系统中,必须调用内核来解析每个内存引用。
C. In an operating system that provides virtual memory, the kernel must be invoked to resolve every memory reference.
D.当内核将处理器从一个应用程序切换到另一个应用程序时,目标应用程序在用户空间运行后会适当地设置页面映射地址寄存器。
D. When a kernel switches a processor from one application to another, the target application sets the page-map address register appropriately after it is running in user space.
2007-1-4
2007-1-4
5.7两个线程 A 和 B 执行名为GLORP的过程,但执行时间总是不同(即,在给定时间只有一个线程调用该过程)。GLORP包含以下代码:程序 GLORP () 获取(lock_a) 获取(lock_b)‥ 释放(lock_b) 释放(lock_a)‥ 获取(lock_b) 获取(lock_a)‥ 释放(lock_a) 释放(lock_b)
5.7 Two threads, A and B, execute a procedure named GLORP but always at different times (that is, only one of the threads calls the procedure at a given time). GLORP contains the following code: procedure GLORP () ACQUIRE (lock_a) ACQUIRE (lock_b)‥ RELEASE (lock_b) RELEASE (lock_a)‥ ACQUIRE (lock_b) ACQUIRE (lock_a)‥ RELEASE (lock_a) RELEASE (lock_b)
5.7a假设其他过程中的其他代码都不会一次获取多个锁,那么是否会出现死锁?(如果是,请举例说明;如果不是,请说明原因。)
5.7a Assuming that no other code in other procedures ever acquires more than one lock at a time, can there be a deadlock? (If yes, give an example; if not, argue why not.)
1995-1-3a
1995-1-3a
5.7b现在,假设两个线程可以同时出现在上面的代码片段中,程序是否会出现死锁?(如果是,请举例说明;如果不是,请说明原因。)
5.7b Now, assuming that the two threads can be in the code fragment above at the same time, can the program deadlock? (If yes, give an example; if not, argue why not.)
1995-1-3b
1995-1-3b
5.8考虑三个线程,同时执行此处显示的三个程序。变量x、y和z是初始值为 0 的整数。
5.8 Consider three threads, concurrently executing the three programs shown here. The variables x, y, and z are integers with initial value 0.
| 主题 1: | 主题 2: | 主题 3: |
| 对于 从1到100的i | 对于 从1到100的i | 对于 从1到100的i |
| 獲得( A ) | 獲得( B ) | 獲得( A ) |
| 獲得( B ) | 獲得( C ) | 獲得( C ) |
| x ← x + 1 | y ← z + 1 | z ← x + 1 |
| 释放( B ) | 释出( C ) | 释出( C ) |
| 释放( A ) | 释放( B ) | 释放( A ) |
5.8a同时执行这三个线程是否会产生死锁?(如果会,请举例说明;如果不会,请说明原因。)
5.8a Can executing these three threads concurrently produce a deadlock? (If yes, give an example; if not, argue why not.)
1993-1-5a
1993-1-5a
5.8b如果每个线程中释放操作的顺序被逆转,你的答案会改变吗?(如果它们可能死锁,请举例说明;如果不能,请说明原因。)
5.8b Does your answer change if the order of the release operations in each thread is reversed? (If they can deadlock, give an example; if not, argue why not.)
1993-1-5b
1993-1-5b
与第 5 章相关的附加练习可以在从第 425 页开始的问题集中找到。
Additional exercises relating to Chapter 5 can be found in the problem sets beginning on page 425.
*我们的同事安德烈亚斯·路透指出,仲裁者可能进入亚稳态,这一可能性自古以来就备受关注:“你们三心二意,要到几时呢?”——列王纪上18:21。
* Our colleague Andreas Reuter points out that the possibility that an arbiter may enter a metastable state has been of concern since antiquity: “How long halt ye between two opinions”?—1 Kings 18:21.
* 16 个字节提供了空间来保存R0、R1、一个参数和一个返回地址。
* The 16 bytes provide space to save R0, R1, one argument, and a return address.
* 1982 年,英特尔还推出了 80186 和 80188,但这些 6 mHz 处理器主要用作嵌入式处理器,而不是个人计算处理器。80186 的主要贡献之一是减少了所需的芯片数量,因为它包含一个 DMA 控制器、一个中断控制器和一个计时器。
* In 1982 Intel also introduced the 80186 and 80188, but these 6-mHz processors were used mostly as embedded processors instead of processors for personal computing. One of the major contributions of the 80186 is the reduction in the number of chips required because it included a DMA controller, an interrupt controller, and a timer.
† Michael D. Schroeder 和 Jerome H. Saltzer。《实现保护环的硬件架构》。《ACM 通讯》 15,3(1972 年 3 月),第 157-170 页。
† Michael D. Schroeder and Jerome H. Saltzer. A hardware architecture for implementing protection rings. Communications of the ACM 15, 3 (March 1972), pages 157–170.
*来源:英特尔网页(http://www.intel.com/pressroom/kits/quickreffam.htm)。
* Source: Intel Web page (http://www.intel.com/pressroom/kits/quickreffam.htm).
† The sum of number of lines in all C files (source and include) in a kernel release.
6.1性能设计
6.1.1性能指标
6.1.1 Performance Metrics
6.1.2性能设计的系统方法
6.1.2 A Systems Approach to Designing for Performance
6.1.3利用工作负载属性减少延迟
6.1.3 Reducing Latency by Exploiting Workload Properties
6.1.4使用并发减少延迟
6.1.4 Reducing Latency using Concurrency
6.1.5提高吞吐量:并发性
6.1.5 Improving Throughput: Concurrency
6.1.6排队和过载
6.1.6 Queuing and Overload
6.1.7克服瓶颈
6.1.7 Fighting Bottlenecks
6.1.8一个例子:I/O瓶颈
6.2多级记忆
6.2.1内存特性
6.2.1 Memory Characterization
6.2.2使用虚拟内存进行多级内存管理
6.2.2 Multilevel Memory Management using Virtual Memory
6.2.3为虚拟内存添加多级内存管理
6.2.3 Adding Multilevel Memory Management to a Virtual Memory
6.2.4分析多级记忆系统
6.2.4 Analyzing Multilevel Memory Systems
6.2.5参考位置和工作集
6.2.5 Locality of Reference and Working Sets
6.2.6多级内存管理策略
6.2.6 Multilevel Memory Management Policies
6.2.7不同政策的比较分析
6.2.7 Comparative Analysis of Different Policies
6.2.8其他页面删除算法
6.2.8 Other Page-Removal Algorithms
6.2.9多级内存管理的其他方面
6.3调度
6.3 Scheduling
6.3.1调度资源
6.3.1 Scheduling Resources
6.3.2调度指标
6.3.2 Scheduling Metrics
6.3.3调度策略
6.3.3 Scheduling Policies
6.3.4案例研究:调度磁盘臂
计算机系统的规范通常包括显式(或隐式)性能目标。例如,规范可能表明系统应能够支持多少个并发用户。通常,最简单的设计无法满足这些目标,因为设计存在瓶颈,即计算机系统中的某个阶段比其他任何阶段执行任务所需的时间更长。为了克服瓶颈,系统设计人员面临的任务是创建一个性能良好但又简单且模块化的设计。
The specification of a computer system typically includes explicit (or implicit) performance goals. For example, the specification may indicate how many concurrent users the system should be able to support. Typically, the simplest design fails to meet these goals because the design has a bottleneck, a stage in the computer system that takes longer to perform its task than any of the other stages. To overcome bottlenecks, the system designer faces the task of creating a design that performs well, yet is simple and modular.
本章介绍了避免或隐藏性能瓶颈的技术。第 6.1 节介绍了识别瓶颈的方法以及处理瓶颈的一般方法,包括利用工作负载属性、并发执行操作、推测和批处理。第 6.2 节介绍了用于解决有效实现多级内存系统的常见问题的一般技术的特定版本。第 6.3 节介绍了服务的调度算法,用于在有多个等待服务的请求时选择首先处理哪个请求。
This chapter describes techniques to avoid or hide performance bottlenecks. Section 6.1 presents ways to identify bottlenecks and the general approaches to handle them, including exploiting workload properties, concurrent execution of operations, speculation, and batching. Section 6.2 examines specific versions of the general techniques to attack the common problem of implementing multilevel memory systems efficiently. Section 6.3 presents scheduling algorithms for services to choose which request to process first, if there are several waiting for service.
计算机系统出现性能瓶颈有两个原因。首先,物理、技术或经济方面的限制限制了某些技术方面的改进速度,而其他方面则发展迅速。物理方面的限制是显而易见的。光速限制了信号从芯片的一端到另一端的传输速度、处理器给定延迟内可以容纳的内存元素数量以及网络消息在互联网上的传输速度。计算机系统还存在许多其他物理限制,例如功率和散热。
Performance bottlenecks show up in computer systems for two reasons. First, limits imposed by physics, technology, or economics restrict the rate of improvement in some dimensions of technology, while other dimensions improve rapidly. An obvious class of limits are the physical ones. The speed of light limits how fast signals travel from one end of a chip to the other, how many memory elements can be within a given latency from the processor, and how fast a network message can travel in the Internet. Many other physical limits appear in computer systems, such as power and heat dissipation.
这些限制迫使设计师做出权衡。例如,通过缩小芯片尺寸,设计师可以使芯片运行得更快,但这也减少了散热面积。更糟糕的是,随着设计师加快芯片速度,功耗也会增加。一个相关的权衡是笔记本电脑的速度和功耗之间的权衡。设计师希望最大限度地降低笔记本电脑的功耗,以便电池使用时间更长,但客户却希望笔记本电脑具有快速的处理器和大而明亮的屏幕。
These limits force a designer to make trade-offs. For example, by shrinking a chip, a designer can make the chip faster, but it also reduces the area from which heat can be dissipated. Worse, the power dissipation increases as the designer speeds up the chip. A related trade-off is between the speed of a laptop and its power consumption. A designer wants to minimize a laptop’s power consumption so that the battery lasts longer, yet customers want laptops with fast processors and large, bright screens.
物理限制只是设计师面临的限制的一部分;还有算法、可靠性和经济限制。限制越多,权衡越多,瓶颈风险也就越大。
Physical limits are only a subset of the limits a designer faces; there are also algorithmic, reliability, and economic limits. More limits mean more trade-offs and a higher risk of bottlenecks.
计算机系统中出现瓶颈的第二个原因是多个客户端可能共享一个设备。如果一个设备正忙于为一个客户端提供服务,其他客户端必须等待,直到该设备可用。此属性迫使系统设计人员回答诸如哪个客户端应首先接收设备等问题。设备是否应该首先执行需要很少工作的请求,也许以延迟需要大量工作的请求为代价?设计人员希望设计一个调度计划,该计划不会让某些客户端挨饿而偏向其他客户端,为每个单独的客户端请求提供较低的周转时间,并且开销很小,以便它可以为许多客户端提供服务。正如我们将看到的,不可能同时最大化所有这些目标,因此设计人员必须做出权衡。权衡可能有利于一类请求而不是另一类请求,并可能导致不受青睐的请求类出现瓶颈。
The second reason bottlenecks surface in computer systems is that several clients may share a device. If a device is busy serving one client, other clients must wait until the device becomes available. This property forces the system designer to answer questions such as which client should receive the device first. Should the device first perform the request that requires little work, perhaps at the cost of delaying the request that requires a lot of work? The designer would like to devise a scheduling plan that doesn’t starve some clients in favor of others, provides low turnaround time for each individual client request, and has little overhead so that it can serve many clients. As we will see, it is impossible to maximize all of these goals simultaneously, and thus a designer must make trade-offs. Trade-offs may favor one class of requests over another and may result in bottlenecks for the unfavored classes of requests.
计算机系统中的性能设计面临两大挑战。首先,必须在技术改进的背景下考虑优化的好处。一些瓶颈是内在的;需要仔细思考才能确保系统运行速度比最慢阶段的性能更快。一些瓶颈取决于技术;随着技术的进步,时间可能会消除这些瓶颈。不幸的是,有时很难判断瓶颈是否是内在的。通常,下一个产品版本的性能优化在产品发货时就无关紧要了,因为技术改进已经完全消除了瓶颈。这种现象在计算机设计中如此常见,以至于导致了设计提示的形成:有疑问时使用蛮力。边栏 6.1讨论了这个提示。
Designing for performance creates two major challenges in computer systems. First, one must consider the benefits of optimization in the context of technology improvements. Some bottlenecks are intrinsic ones; they require careful thinking to ensure that the system runs faster than the performance of the slowest stage. Some bottlenecks are technology dependent; time may eliminate these, as technology improves. Unfortunately, it is sometimes difficult to decide whether or not a bottleneck is intrinsic. Not uncommonly, a performance optimization for the next product release is irrelevant by the time the product ships because technology improvements have removed the bottleneck completely. This phenomenon is so common in computer design that it has led to formulation of the design hint: when in doubt use brute force. Sidebar 6.1 discusses this hint.
侧边栏 6.1 设计提示:如有疑问,请使用强力破解
Sidebar 6.1 Design Hint: When in Doubt use Brute Force
本章介绍了一些设计提示,可帮助设计师在面临限制时解决权衡问题。这些设计提示之所以被称作提示,是因为它们通常会引导设计师朝着正确的方向前进,但有时却并非如此。本书中我们只介绍了其中的一些,但感兴趣的读者应该阅读B. Lampson 的《计算机系统设计提示》,其中以提示的形式提供了更多实用指南 [进一步阅读建议 1.5.4 ]。
This chapter describes a few design hints that help a designer resolve trade-offs in the face of limits. These design hints are hints because they often guide the designer in the right direction, but sometimes they don’t. In this book we cover only a few, but the interested reader should digest Hints for computer system design by B. Lampson, which presents many more practical guidelines in the form of hints [Suggestions for Further Reading 1.5.4].
设计提示“有疑问时使用蛮力”是d(技术)/dt曲线的直接推论(参见第 1.4 节)。考虑到计算技术的历史改进速度,选择简单且易于理解的算法通常比选择复杂且特征不明显的算法更为明智。等到复杂算法被完全理解、实现和调试时,新硬件可能能够足够快地执行简单算法。汤普森和里奇在UNIX系统中使用固定大小的进程表并线性搜索该表,因为表易于实现并且进程数量较少。汤普森还与乔·康登 (Joe Condon) 一起构建了 Belle 国际象棋机,该机主要依靠专用硬件来每秒搜索许多位置,而不是依靠复杂的算法。 Belle 在 20 世纪 70 年代末和 80 年代初多次赢得世界计算机象棋锦标赛,并获得了 2250 的 ELO 等级分。(ELO 是世界象棋联合会 (FIDI) 用来对象棋选手进行排名的数字等级分系统;2250 的等级分表明选手具有很强的竞争力。)后来,随着技术的发展,在现成的 PC 上执行强力搜索算法的程序征服了世界计算机象棋锦标赛。截至 2005 年 8 月,Hydra 超级计算机(64 台 PC,每台都配有一个象棋协处理器)的创建者估计其 ELO 等级为 3200,这比最好的人类选手还要好。
The design hint “when in doubt use brute force” is a direct corollary of the d(technology)/dt curve (see Section 1.4). Given computing technology’s historical rate of improvement, it is typically wiser to choose simple algorithms that are well understood rather than complex, badly characterized algorithms. By the time the complex algorithm is fully understood, implemented, and debugged, new hardware might be able to execute the simple algorithm fast enough. Thompson and Ritchie used a fixed-size table of processes in the UNIX system and searched the table linearly because a table was simple to implement and the number of processes was small. With Joe Condon, Thompson also built the Belle chess machine that relied mostly on special-purpose hardware to search many positions per second rather than on sophisticated algorithms. Belle won the world computer chess championships several times in the late 1970s and early 1980s and achieved an ELO rating of 2250. (ELO is a numerical rating systems used by the World Chess Federation (FIDI) to rank chess players; a rating of 2250 makes one a strong competitive player.) Later, as technology marched on, programs that performed brute-force searching algorithms on an off-the-shelf PC conquered the world computer chess championships. As of August 2005, the Hydra supercomputer (64 PCs, each with a chess coprocessor) is estimated by its creators to have an ELO rating of 3200, which is better than the best human player.
性能设计中的第二个挑战是保持设计的简单性。例如,如果设计使用具有大致相同高级功能但性能截然不同的不同设备,则挑战在于抽象设备,以便可以通过简单统一的接口使用它们。在本章中,我们将了解如何巧妙地实现内存的READ和WRITE接口,以透明的方式将 RAM 的有效大小扩展到磁盘的大小。
A second challenge in designing for performance is maintaining the simplicity of the design. For example, if the design uses different devices with approximately the same high-level function but radically different performance, a challenge is to abstract devices such that they can be used through a simple uniform interface. In this chapter, we see how a clever implementation of the READ and WRITE interface for memory can transparently extend the effective size of RAM to the size of a magnetic disk.
为了更全面地理解瓶颈,请回想一下,计算机系统以模块形式组织以实现模块化的优势,并且为了处理请求,请求可能会从一个模块传递到另一个模块。例如,相机可以生成包含视频帧的连续请求流,并将其发送到数字化每个帧的服务。数字化服务又可以将其输出发送到将帧存储在磁盘上的文件服务。
To understand bottlenecks more fully, recall that computer systems are organized in modules to achieve the benefits of modularity and that to process a request, the request may be handed from one module to another. For example, a camera may generate a continuous stream of requests containing video frames and send them to a service that digitizes each frame. The digitizing service in turn may send its output to a file service that stores the frames on a magnetic disk.
通过以客户端/服务风格描述此应用程序,我们可以获得一些有关重要性能指标的见解。很明显,在这样的计算机系统中,有四个指标非常重要:服务容量、利用率、客户端必须等待请求完成的时间以及吞吐量(服务处理请求的速率)。我们将依次讨论每个指标。
By describing this application in a client/service style, we can obtain some insights about important performance metrics. It is immediately clear that in a computer system such as this one, four metrics are of importance: the capacity of the service, its utilization, the time clients must wait for request to complete, and throughput, the rate at which services can handle requests. We will discuss each metric in turn.
每个服务都有一定的容量,这是服务大小或资源量的一致度量。利用率是用于某些给定请求工作负载的资源容量百分比。处理器容量的一个简单度量是周期。例如,在某些工作负载的持续时间内,处理器的利用率可能为 10%,这意味着其 90% 的处理器周期未使用。对于磁盘,容量通常以扇区为单位来衡量。如果磁盘的利用率为 80%,则其 80% 的扇区用于存储数据。
Every service has some capacity, a consistent measure of a service’s size or amount of resources. Utilization is the percentage of capacity of a resource that is used for some given workload of requests. A simple measure of processor capacity is cycles. For example, the processor might be utilized 10% for the duration of some workload, which means that 90% of its processor cycles are unused. For a magnetic disk, the capacity is usually measured in sectors. If a disk is utilized 80%, then 80% of its sectors are used to store data.
在分层系统中,每一层对底层资源的容量和利用率可能有不同的看法。例如,处理器的利用率可能为 95%,但只有 70% 的周期用于应用程序,因为操作系统使用了 25%。每一层都认为其下面的层所做的是时间和空间上的开销,而其上面的层所做的是有用功。在处理器的示例中,从应用程序的角度来看,操作系统使用的 25% 的周期是开销,而 70% 是有用功。在磁盘的示例中,如果 10% 的磁盘用于存储文件系统数据结构,那么从应用程序的角度来看,文件系统使用的 10% 是开销,只有 90% 是有用容量。
In a layered system, each layer may have a different view of the capacity and utilization of the underlying resources. For example, a processor may be 95% utilized but delivering only 70% of its cycles to the application because the operating system uses 25%. Each layer considers what the layers below it do to be overhead in time and space, and what the layers above it do to be useful work. In the processor example, from the application point of view, the 25% of cycles used by the operating system is overhead and the 70% is useful work. In the disk example, if 10% of the disk is used for storing file system data structures, then from the application point of view that 10% used by the file system is overhead and only 90% is useful capacity.
延迟是指系统输入的变化与输出的相应变化之间的延迟。从客户端/服务的角度来看,请求的延迟是从发出请求到从服务收到响应的时间。此延迟有几个组成部分:向服务发送消息的延迟、处理请求的延迟以及发回响应的延迟。
Latency is the delay between a change at the input to a system and the corresponding change at its output. From the client/service perspective, the latency of a request is the time from issuing the request until the time the response is received from the service. This latency has several components: the latency of sending a message to the service, the latency of processing the request, and the latency of sending a response back.
如果一项任务(例如要求服务执行请求)是一系列子任务,我们可以将完整的任务视为遍历管道的各个阶段,其中管道的每个阶段都执行一个子任务(参见图 6.1)。在我们的示例中,管道中的第一阶段是发送请求,第二阶段是服务将帧数字化,第三阶段是文件服务存储帧,最后阶段是将响应发送回客户端。
If a task, such as asking a service to perform a request, is a sequence of subtasks, we can think of the complete task as traversing stages of a pipeline, where each stage of the pipeline performs a subtask (see Figure 6.1). In our example, the first stage in the pipeline is sending the request, the second stage is the service digitizing the frame, the third stage is the file service storing the frame, and the final stage is sending a response back to the client.
图 6.1由几个阶段组成的简单服务。
Figure 6.1 A simple service composed of several stages.
了解了这个流水线模型,很容易看出具有 A 和 B 阶段的流水线的延迟大于或等于流水线中每个阶段的延迟之和:
With this pipeline model in mind, it is easy to see that latency of a pipeline with stages A and B is greater than or equal to the sum of the latencies for each stage in the pipeline:
这个数字可能会更大,因为将请求从一个阶段传递到另一个阶段可能会增加一些延迟。例如,如果各个阶段对应于不同的服务,这些服务可能运行在通过网络连接的不同计算机上,那么将请求从一个阶段传递到另一个阶段的开销可能会增加足够的延迟,以至于无法忽略。
It is possibly greater because passing a request from one stage to another might add some latency. For example, if the stages correspond to different services, perhaps running on different computers connected by a network, then the overhead of passing requests from one stage to another may add enough latency that it cannot be ignored.
如果这些阶段属于单个服务,则额外的延迟通常很小(例如,调用过程的开销),并且通常可以忽略不计,以进行一阶性能分析。因此,在这种情况下,为了预测尚未运行但预计会执行两个功能(A 和 B,且延迟已知)的服务的延迟,设计人员可以通过将 A 的延迟和 B 的延迟相加来近似 A 和 B 的联合延迟。
If the stages are of a single service, that additional latency is typically small (e.g., the overhead of invoking a procedure) and can usually be ignored for first-order analysis of performance. Thus, in this case, to predict the latency of a service that isn’t running yet but is expected to perform two functions, A and B, with known latencies, a designer can approximate the joint latency of A and B by adding the latency of A and the latency of B.
吞吐量是衡量服务在给定请求工作负载下完成有用工作的速率的指标。以相机为例,我们可能关心的吞吐量是系统每秒可以处理多少帧,因为它可能决定我们要购买什么质量的相机。
Throughput is a measure of the rate of useful work done by a service for some given workload of requests. In the camera example, the throughput we might care about is how many frames per second the system can process because it may determine what quality camera we want to buy.
具有流水线阶段的系统的吞吐量小于或等于每个阶段的吞吐量的最小值:
The throughput of a system with pipelined stages is less than or equal to the minimum of the throughput for each stage:
同样,如果各个阶段属于单个服务,则将请求从一个阶段传递到另一个阶段通常不会增加开销,对总吞吐量的影响也不大。因此,对于一阶分析,可以忽略该开销,并且关系通常接近相等。
Again, if the stages are of a single service, passing the request from one stage to another usually adds little overhead and has little impact on total throughput. Thus, for first-order analysis that overhead can be ignored, and the relation is usually close to equality.
考虑一个具有两个阶段的计算机系统:一个阶段能够以每秒 1,000 千字节的速度处理数据,另一个阶段以每秒 100 千字节的速度处理数据。如果快速阶段为每个字节输入生成一个字节的输出,则总吞吐量必须小于或等于每秒 100 千字节。如果在两个阶段之间传递请求的开销可以忽略不计,则系统的吞吐量等于瓶颈阶段的吞吐量,即每秒 100 千字节。在这种情况下,阶段 1 的利用率为 10%,阶段 2 的利用率为 100%。
Consider a computer system with two stages: one that is able to process data at a rate of 1,000 kilobytes per second and a second one at a rate of 100 kilobytes per second. If the fast stage generates one byte of output for each byte of input, the overall throughput must be less than or equal to 100 kilobytes per second. If there is negligible overhead in passing requests between the two stages, then the throughput of the system is equal to the throughput of the bottleneck stage, 100 kilobytes per second. In this case, the utilization of stage 1 is 10% and that of stage 2 is 100%.
当某个阶段按顺序处理请求时,该阶段的吞吐量和延迟直接相关。某个阶段处理的平均请求数与处理单个请求的平均时间成反比:
When a stage processes requests serially, the throughput and the latency of a stage are directly related. The average number of requests a stage handles is inversely proportional to the average time to process a single request:
如果所有阶段都按顺序处理请求,则整个管道的平均吞吐量与请求在管道中花费的平均时间成反比。在这些管道中,减少延迟可以提高吞吐量,反之亦然。
If all stages process requests serially, the average throughput of the complete pipeline is inversely proportional to the average time a request spends in the pipeline. In these pipelines, reducing latency improves throughput, and the other way around.
当某个阶段并发处理请求时,正如我们将在本章后面看到的那样,延迟和吞吐量之间没有直接关系。对于并发处理请求的阶段,吞吐量的增加可能不会导致延迟的减少。一个有用的类比是水以恒定速度流过的管道。可以有几个并行管道(或一个更粗的管道),这可以提高吞吐量,但不会改变延迟。
When a stage processes requests concurrently, as we will see later in this chapter, there is no direct relationship between latency and throughput. For stages that process requests concurrently, an increase in throughput may not lead to a decrease in latency. A useful analogy is pipes through which water flows with a constant velocity. One can have several parallel pipes (or one fatter pipe), which improves throughput but doesn’t change latency.
要衡量在减少瓶颈方面我们能期望取得多大的改进,我们必须识别并确定最慢和次慢瓶颈的性能。要提高所有阶段吞吐量相同的系统的吞吐量,就需要改进所有阶段。另一方面,改进吞吐量比其他阶段低 10 倍的阶段可能会导致整个系统的吞吐量提高 10 倍。我们可以通过测量或基于每个瓶颈的性能特征使用简单的分析计算来确定这些瓶颈。原则上,计算机系统中任何问题的性能都可以解释,但有时可能需要大量挖掘才能找到解释;例如,请参阅 Perl 和 Sites 对 Windows NT 性能的研究 [进一步阅读建议 6.4.1 ]。
To gauge how much improvement we can hope for in reducing a bottleneck, we must identify and determine the performance of the slowest and the next-slowest bottleneck. To improve the throughput of a system in which all stages have equal throughput requires improving all stages. On the other hand, improving the stage that has a throughput that is 10 times lower than any other stage’s throughput may result in a factor of 10 improvement in the throughput of the whole system. We might determine these bottlenecks by measurements or by using simple analytical calculations based on the performance characteristics of each bottleneck. In principle, the performance of any issue in a computer system can be explained, but sometimes it may require substantial digging to find the explanation; see, for example, the study by Perl and Sites on Windows NT’s performance [Suggestions for Further Reading 6.4.1].
我们应该从系统的角度来考虑性能优化。这个观察结果可能听起来微不足道,但许多人年的工作已经消失在优化单个阶段的过程中,而这些阶段只带来了微小的整体性能改进。工程师们之所以倾向于对单个阶段进行微调,是因为优化会带来一些可衡量的好处。单个工程师可以设计一种优化(例如,用更快的算法替换慢速算法、删除不必要的昂贵操作、重新组织代码以获得快速路径等),实现它并对其进行测量,并且通常可以观察到该阶段的一些性能改进。这种改进会刺激另一种优化的设计,从而带来新的好处,依此类推。一旦进入这个循环,就很难记住收益递减规律,并意识到进一步的改进可能对整个系统几乎没有好处。
One should approach performance optimization from a systems point of view. This observation may sound trivial, but many person-years of work have disappeared in optimizing individual stages that resulted in small overall performance improvements. The reason that engineers are tempted to fine-tune a single stage is that optimizations result in some measurable benefits. An individual engineer can design an optimization (e.g., replacing a slow algorithm with a faster algorithm, removing unnecessary expensive operations, reorganizing the code to have a fast path, etc.), implement it, and measure it, and can usually observe some performance improvement in that stage. This improvement stimulates the design of another optimization, which results in new benefits, and so on. Once one gets into this cycle, it is difficult to keep the law of diminishing returns in mind and realize that further improvements may result in little benefit to the system as a whole.
由于优化单个阶段通常会遇到收益递减规律,因此优先采用关注整体性能的方法。第 1.5.2 节中阐述的迭代方法实现了这一目标,因为在每次迭代中,设计人员都必须考虑下一次迭代是否值得执行。如果下一次迭代确定了一个瓶颈,如果消除该瓶颈,收益就会递减,则设计人员可以停止。如果最终性能足够好,则设计人员的工作就完成了。如果最终性能不符合目标,则设计人员可能必须重新考虑整个设计或重新审视设计规范。
Since optimizing individual stages typically runs into the law of diminishing returns, an approach that focuses on overall performance is preferred. The iterative approach articulated in Section 1.5.2 achieves this goal because at each iteration the designer must consider whether or not the next iteration is worth performing. If the next iteration identifies a bottleneck that, if removed, shows diminished returns, the designer can stop. If the final performance is good enough, the designer’s job is done. If the final performance doesn’t meet the target, the designer may have to rethink the whole design or revisit the design specification.
性能设计的迭代方法包括以下步骤:
The iterative approach for designing for performance has the following steps:
1.测量系统以确定是否需要增强性能。如果性能是一个问题,请确定性能的哪个方面(吞吐量或延迟)是问题所在。对于多级管道,其中各阶段同时处理请求,延迟和吞吐量之间没有直接关系,因此改善延迟和提高吞吐量可能需要不同的技术。
1. Measure the system to find out whether or not a performance enhancement is needed. If performance is a problem, identify which aspect of performance (throughput or latency) is the problem. For multistage pipelines in which stages process requests concurrently, there is no direct relationship between latency and throughput, so improving latency and improving throughput might require different techniques.
2.再次测量,这次是为了确定性能瓶颈。瓶颈可能不在设计人员预期的位置,并且可能从一个设计迭代转移到另一个设计迭代。
2. Measure again, this time to identify the performance bottleneck. The bottleneck may not be in the place the designer expected and may shift from one design iteration to another.
3.使用简单的粗略模型预测所提议的性能增强的影响。(我们在本章中介绍了一些简单的模型。)此预测包括确定下一个瓶颈在哪里。确定下一个瓶颈的一种快速方法是不切实际地假设计划的性能增强将消除当前瓶颈并导致零延迟和无限吞吐量的阶段。在此假设下,确定下一个瓶颈并计算其性能。此计算将得出以下两个结论之一:
3. Predict the impact of the proposed performance enhancement with a simple back-of-the-envelope model. (We introduce a few simple models in this chapter.) This prediction includes determining where the next bottleneck will be. A quick way to determine the next bottleneck is to unrealistically assume that the planned performance enhancement will remove the current bottleneck and result in a stage with zero latency and infinite throughput. Under this assumption, determine the next bottleneck and calculate its performance. This calculation will result in one of two conclusions:
A。消除当前瓶颈并不能显著提高系统性能。在这种情况下,请停止迭代,重新考虑整个设计或重新审视需求。也许设计师可以调整阶段之间的接口,以容忍昂贵的操作。我们将在下一节中讨论几种方法。
a. Removing the current bottleneck doesn’t improve system performance significantly. In this case, stop iterating, and reconsider the whole design or revisit the requirements. Perhaps the designer can adjust the interfaces between stages with the goal of tolerating costly operations. We will discuss several approaches in the next sections.
b.消除当前瓶颈可能会提高系统性能。在这种情况下,请将注意力集中在瓶颈阶段。考虑使用强力方法来缓解瓶颈阶段(例如,添加更多内存)。利用曲线可能比聪明更便宜。如果蛮力方法无法缓解瓶颈,那就聪明一点。例如,尝试利用工作负载的属性或找到更好的算法。
b. Removing the current bottleneck is likely to improve the system performance. In this case, focus attention on the bottleneck stage. Consider brute-force methods of relieving the bottleneck stage (e.g., add more memory). Taking advantage of the curve may be less expensive than being clever. If brute-force methods won’t relieve the bottleneck, be smart. For example, try to exploit properties of the workload or find better algorithms.
4.衡量新实施情况,以验证变更是否具有预期的影响。如果没有,请重新回顾步骤 1-3 并确定问题所在。
4. Measure the new implementation to verify that the change has the predicted impact. If not, revisit steps 1–3 and determine what went wrong.
5. Iterate. Repeat steps 1–5 until the performance meets the required level.
本章的其余部分介绍了减少延迟和增加吞吐量的各种系统方法,以及预测最终性能的简单性能模型。
The rest of this chapter introduces various systems approaches to reducing latency and increasing throughput, as well as simple performance models to predict the resulting performance.
减少延迟非常困难,因为设计人员经常会遇到物理、算法和经济方面的限制。例如,从美国东海岸的客户端向西海岸的服务发送消息的速度取决于光速。在哈希表中查找项目的速度不可能比实现哈希表的最佳算法更快。构建具有统一低延迟的超大内存在经济上是不可行的。
Reducing latency is difficult because the designer often runs into physical, algorithmic, and economic limits. For example, sending a message from a client on the east coast of the United States to a service on the west coast is dominated by the speed of light. Looking up an item in a hash table cannot go faster than the best algorithm for implementing hash tables. Building a very large memory that has uniform low latency is economically infeasible.
一旦设计人员遇到此类限制,常见的做法是减少某些请求的延迟,甚至可能以增加其他请求的延迟为代价。设计人员可能会观察到某些请求比其他请求更常见,并利用该观察结果,通过将分阶段管道拆分为一条用于频繁请求的快速路径和一条用于其他请求的慢速路径来提高频繁操作的性能(参见图 6.2)。例如,服务可能会记住经常询问的请求的结果,以便当它收到最近处理的请求的重复时,它可以立即返回记住的结果,而无需重新计算。在实践中,利用应用程序中的不一致性效果很好,以至于它导致了针对常见情况的设计提示优化(参见边栏 6.2)。
Once a designer has run into such limits, the common approach is to reduce the latency of some requests, perhaps even at the cost of increasing the latency for other requests. A designer may observe that certain requests are more common than other requests, and use that observation to improve the performance of the frequent operations by splitting the staged pipeline into a fast path for the frequent requests and a slow path for other requests (see Figure 6.2). For example, a service might remember the results of frequently asked requests so that when it receives a repeat of a recently handled request, it can return the remembered result immediately without having to recompute it. In practice, exploiting non-uniformity in applications works so well that it has led to the design hint optimize for the common case (see Sidebar 6.2).
图 6.2具有慢速路径和快速路径的简单服务。
Figure 6.2 A simple service with a slow and fast path.
侧边栏 6.2 设计提示针对常见情况进行优化
Sidebar 6.2 Design Hint Optimize for the Common Case
缓存(参见2.1.1.3 节)是针对最常见情况优化路径的最常见示例。我们在域名系统案例研究(4.4 节)中看到了缓存。再以 Web 浏览器为例。大多数 Web 浏览器都会维护最近访问的网页的缓存。此缓存按网页名称(例如http://www.Scholarly.edu)进行索引并返回该名称的页面。如果用户再次要求查看同一页面,则缓存可以立即返回页面的缓存副本(快速路径);只有第一次访问才需要访问服务(慢速路径)。除了改善用户的交互体验之外,缓存还有助于减少服务负载和网络负载。由于缓存非常有效,因此许多应用程序都使用了多种缓存。例如,除了缓存网页之外,许多 Web 浏览器还具有缓存功能,用于存储查找名称的结果,如“ www.Scholarly.edu ”,这样下次对“ www.Scholarly.edu ”的请求就不需要进行 DNS 查找。
A cache (see Section 2.1.1.3) is the most common example of optimizing path for the most frequent cases. We saw caches in the case study of the Domain Name System (in Section 4.4). As another example, consider a Web browser. Most Web browsers maintain a cache of recently accessed Web pages. This cache is indexed by the name of the Web page (e.g., http://www.Scholarly.edu) and returns the page for that name. If the user asks to view the same page again, then the cache can return the cached copy of the page immediately (a fast path); only the first access requires a trip to the service (a slow path). In addition to improving the user’s interactive experience, the cache helps reduce the load on services and the load on the network. Because caches are so effective, many applications use several of them. For example, in addition to caching Web pages, many Web browsers have a cache to store the results of looking up names, such as “www.Scholarly.edu”, so that the next request to “www.Scholarly.edu” doesn’t require a DNS lookup.
第 6.2 节中的多级内存设计是设计师如何充分利用工作负载中的非均匀性的另一个例子。由于应用程序具有引用局部性,因此可以通过组合小型但快速的内存和大型但缓慢的内存来构建大型且快速的内存系统。
The design of multilevel memory in Section 6.2 is another example of how well a designer can exploit non-uniformity in a workload. Because applications have locality of reference, one can build large and fast memory systems out of a combination of a small but fast memory and a large but slow memory.
为了评估具有快速和慢速路径的系统的性能,设计人员通常会计算平均延迟。如果我们知道快速和慢速路径的延迟,以及系统采用快速路径的频率,则平均延迟为:
To evaluate the performance of systems with a fast and slow path, designers typically compute the average latency. If we know the latency of the fast and slow paths, and the frequency with which the system will take the fast path, then the average latency is:
(6.1)
(6.1)
引入快速路径是否值得,取决于快速路径和慢速路径之间的相对延迟差异,以及系统使用快速路径的频率,后者取决于工作负载。此外,人们可能能够改变设计,使快速路径变得更快,而慢速路径则变得更慢。如果采用快速路径的频率较低,那么引入快速路径(并可能以慢速路径为代价对其进行优化)可能不值得付出复杂性。在实践中,正如我们将在第 6.2 节中看到的那样,许多工作负载的请求分布并不均匀,引入快速路径效果很好。
Whether introducing a fast path is worth the effort is dependent on the relative difference in latency between the fast and slow path, and the frequency with which the system can use the fast path, which is dependent on the workload. In addition, one might be able to change the design so that the fast path becomes faster at the cost of a slower slow path. If the frequency of taking the fast path is low, then introducing a fast path (and perhaps optimizing it at the cost of the slow path) is likely not worth the complexity. In practice, as we will see in Section 6.2, many workloads don’t have a uniform distribution of requests, and introducing a fast path works well.
另一种减少延迟的方法是并行化某个阶段,这可能需要一些智力上的努力,但这种方法可能很有效。我们将某个阶段必须针对单个请求进行的处理分成可以同时执行的子任务。然后,只要有多个处理器可用,就可以分配它们并行运行这些子任务。该方法既可以应用于多处理器系统,也可以应用于完全独立的计算机(如果子任务不是太复杂)。
Another way to reduce latency that may require some intellectual effort but that can be effective is to parallelize a stage. We take the processing that a stage must do for a single request and divide that processing up into subtasks that can be performed concurrently. Then, whenever several processors are available they can be assigned to run those subtasks in parallel. The method can be applied either within a multiprocessor system or (if the subtasks aren’t too entangled) with completely separate computers.
如果处理完全并行化(即每个子任务无需与其他子任务协调即可运行,且每个子任务需要的工作量相同),那么该计划原则上可以将处理速度提高n倍,其中n是并行执行的子任务数。实际上,加速比通常小于n,因为并行化计算会产生开销 — — 例如,子任务需要相互通信以交换中间结果;因为子任务不需要相同的工作量;因为计算不能完全并行执行,所以部分计算必须按顺序执行;或者因为子任务互相干扰(例如,它们争夺共享资源,如锁、共享内存或共享通信网络)。
If the processing parallelizes perfectly (i.e., each subtask can run without any coordination with other subtasks and each subtask requires the same amount of work), then this plan can, in principle, speed up the processing by a factor n, where n is the number of subtasks executing in parallel. In practice, the speedup is usually less than n because there is overhead in parallelizing a computation—the subtasks need to communicate with each other, for example, to exchange intermediate results; because the subtasks do not require an equal amount of work; because the computation cannot be executed completely in parallel, so some fraction of the computation must be executed sequentially; or because the subtasks interfere with each other (e.g., they contend for a shared resource such as a lock, a shared memory, or a shared communication network).
考虑搜索引擎为响应用户搜索查询需要执行的处理。Google 搜索引擎的早期版本 — — 在“进一步阅读建议”3.2.4中进行了更详细的描述— — 将此处理并行化如下。搜索引擎将 Web 索引分成n 个部分,每个部分存储在单独的机器上。当前端收到用户查询时,它会将查询的副本发送给n台机器中的每一台。每台机器针对其索引部分运行查询,然后将结果发送回前端。前端累积来自n台机器的结果,选择合适的显示顺序,生成网页并将其发送给用户。如果索引很大并且n台机器中的每一台都必须执行大量相似的计算,则此计划可以大大加快速度。由于存在并行化开销(将查询发送给n台机器,接收n 个部分结果,然后合并它们),因此不太可能实现n倍的完全加速;因为n台机器上的工作量并不是完全平衡的,前端必须等到最慢的机器响应;而且前端在分拆查询和合并时所做的工作还没有并行化。
Consider the processing that a search engine needs to perform in order to respond to a user search query. An early version of Google’s search engine—described in more detail in Suggestions for Further Reading 3.2.4—parallelized this processing as follows. The search engine splits the index of the Web up in n pieces, each piece stored on a separate machine. When a front end receives a user query, it sends a copy of the query to each of the n machines. Each machine runs the query against its part of the index and sends the results back to the front end. The front end accumulates the results from the n machines, chooses a good order in which to display them, generates a Web page, and sends it to the user. This plan can give good speedup if the index is large and each of the n machines must perform a substantial, similar amount of computation. It is unlikely to achieve a full speedup of a factor n because there is parallelization overhead (to send the query to the n machines, receive n partial results, and merge them); because the amount of work is not balanced perfectly across the n machines and the front end must wait until the slowest responds; and because the work done by the front end in farming out the query and merging hasn’t been parallelized.
尽管并行化可以提高性能,但必须克服几个挑战。首先,许多应用程序难以并行化。诸如搜索之类的应用程序具有可利用的并行性,但其他计算不能轻易分成n 个几乎独立的部分。其次,开发并行应用程序很困难,因为程序员必须管理并发性并协调不同子任务的活动。如我们在第 5 章中看到的那样,很容易犯错并引入竞争条件和死锁。已经开发了系统来使并行应用程序的开发更容易,但它们通常局限于特定领域。Dean 和 Ghemawat 的论文 [进一步阅读建议 6.4.3 ] 提供了一个示例,说明如何将某些风格化应用程序在数百台机器上并行运行的编程和管理工作量降至最低。但一般而言,程序员必须经常处理线程和锁或显式消息传递才能实现并发。
Although parallelizing can improve performance, several challenges must be overcome. First, many applications are difficult to parallelize. Applications such as search have exploitable parallelism, but other computations don’t split easily into n mostly independent pieces. Second, developing parallel applications is difficult because the programmer must manage the concurrency and coordinate the activities of the different subtasks. As we saw in Chapter 5, it is easy to get this wrong and introduce race conditions and deadlocks. Systems have been developed to make development of parallel applications easier, but they are often limited to a particular domain. The paper by Dean and Ghemawat [Suggestions for Further Reading 6.4.3] provides an example of how the programming and management effort can be minimized for certain stylized applications running in parallel on hundreds of machines. In general, however, programmers must often struggle with threads and locks, or explicit message passing, to obtain concurrency.
由于并行化应用程序面临的这两个挑战,设计人员传统上倾向于依靠持续的技术改进来减少应用程序延迟。然而,物理和工程限制(主要是散热问题)现在正导致处理器制造商不再制造更快的处理器,而是在单个芯片上放置多个(可能很快就会是几百个甚至几千个,正如一些人预测的那样 [进一步阅读建议 1.6.4 ])处理器。这一发展意味着使用并发性来提高性能的重要性将不可避免地增加。
Because of these two challenges in parallelizing applications, designers traditionally have preferred to rely on continuous technology improvements to reduce application latency. However, physical and engineering limitations (primarily the problem of heat dissipation) are now leading processor manufacturers away from making processors faster and toward placing several (and soon, probably, several hundred or even several thousand, as some are predicting [Suggestions for Further Reading 1.6.4]) processors on a single chip. This development means that improving performance by using concurrency will inevitably increase in importance.
如果设计人员由于限制而无法减少请求的延迟,则另一种方法是通过将请求与其他请求重叠来隐藏请求的延迟。这种方法不会改善单个请求的延迟,但可以提高系统吞吐量。由于隐藏延迟通常比改善延迟容易得多,因此它导致了这样的提示:与其减少延迟,不如隐藏延迟(参见边栏 6.3)。本节讨论如何在多级管道中引入并发性以提高吞吐量。
If the designer cannot reduce the latency of a request because of limits, an alternative approach is to hide the latency of a request by overlapping it with other requests. This approach doesn’t improve the latency of an individual request, but it can improve system throughput. Because hiding latency is often much easier to achieve than improving latency, it has led to the hint: instead of reducing latency, hide it (see Sidebar 6.3). This section discusses how one can introduce concurrency in a multistage pipeline to increase throughput.
侧边栏 6.3 设计提示 不要降低延迟,而要隐藏它
Sidebar 6.3 Design Hint Instead of Reducing Latency, Hide It
延迟通常不受设计人员控制,而是由光速等物理特性强加给设计人员的。考虑以光速从美国东海岸向西海岸发送一条消息。这大约需要 20 毫秒(参见第 7.1 节 [在线]);与此同时,处理器可以执行数百万条指令。更糟糕的是,每一代新一代处理器的速度每年都在提高,但光速并没有提高。正如网络研究员 David Clark 简洁地说的那样:“人是无法贿赂上帝的。”光速在计算机设计的许多地方都是一种内在障碍,即使距离很短。例如,芯片太大,以至于信号从芯片的一端传输到另一端是一个瓶颈,限制了芯片的时钟速度。
Latency is often not under the control of the designer but rather is imposed on the designer by physical properties such as the speed of light. Consider sending a message from the east coast of the United States to the west coast at the speed of light. This takes about 20 milliseconds (see Section 7.1 [on-line]); in the same time, a processor can execute millions of instructions. Worse, each new generation of processors gets faster every year, but the speed of light doesn’t improve. As David Clark, a network researcher, put it succinctly: “One cannot bribe God.” The speed of light shows up as an intrinsic barrier in many places of computer design, even when the distances are short. For example, dies are so large that for a signal to travel from one end of a chip to another is a bottleneck that limits the clock speed of a chip.
当设计师面临这种内在限制时,唯一的选择就是设计隐藏延迟的系统,并尝试利用遵循 d(技术)/dt 的性能维度。例如,数据网络的传输速率已大幅提高,因此如果设计师可以组织系统,使通信可以与有用的计算重叠,并且许多网络请求可以分批处理为一个大请求,那么就可以高效地传输这个大请求。许多 Web 浏览器都采用这种策略:当大型传输在后台运行时,用户可以继续浏览网页,从而隐藏传输的延迟。
When a designer is faced with such intrinsic limits, the only option is to design systems that hide latency and try to exploit performance dimensions that do follow d(technology)/dt. For example, transmission rates for data networks have improved dramatically, and so if a designer can organize the system such that communication can be overlapped with useful computation and many network requests can be batched into a large request, then the large request can be transferred efficiently. Many Web browsers use this strategy: while a large transfer runs in the background, users can continue browsing Web pages, hiding the latency of the transfer.
为了重叠请求,我们为管道中的每个阶段提供自己的计算线程,以便它们可以并发计算,就像流水线一样运行(参见图 6.3)。如果某个阶段已完成其任务并将请求移交给下一个阶段,则该阶段可以开始处理第二个请求,而下一个阶段则处理第一个请求。通过这种方式,管道可以同时处理多个请求。
To overlap requests, we give each stage in the pipeline its own thread of computation so that it can compute concurrently, operating much like an assembly line (see Figure 6.3). If a stage has completed its task and has handed off the request to the next stage, then the stage can start processing the second request while the next stage processes the first request. In this fashion, the pipeline can work on several requests concurrently.
图 6.3由多个阶段组成的简单服务,每个阶段使用线程同时运行。
Figure 6.3 A simple service composed of several stages, with each stage operating concurrently using threads.
这种方法的实现有两个挑战。首先,管道的某些阶段可能比其他阶段运行得慢。因此,一个阶段可能无法将请求移交给下一个阶段,因为下一个阶段仍在处理上一个请求。因此,请求队列可能会建立起来,而其他阶段可能处于空闲状态。为了确保两个阶段之间的队列不会无限制地增长,阶段通常使用有界缓冲区耦合。我们将在第6.1.6 节中更详细地讨论排队。
An implementation of this approach has two challenges. First, some stages of the pipeline may operate more slowly than other stage. As a result, one stage might not be able to hand off the request to the next stage because that next stage is still working on a previous request. As a result, a queue of requests may build up, while other stages might be idle. To ensure that a queue between two stages doesn’t grow without bound, the stages are often coupled using a bounded buffer. We will discuss queuing in more detail in Section 6.1.6.
第二个挑战是必须有多个请求可用。如果系统有多个客户端,每个客户端都会生成一个请求,那么多个请求的自然来源之一就是系统。如果客户端异步操作,单个客户端也可以是多个请求的来源。当异步客户端发出请求时,它不会等待响应,而是继续计算,也许会发出更多请求。异步发出多个请求的主要挑战是客户端必须将响应与未完成的请求进行匹配。
The second challenge is that several requests must be available. One natural source of multiple requests is if the system has several clients, each generating a request. A single client can also be a source of multiple requests if the client operates asynchronously. When an asynchronous client issues a request, rather than waiting for the response, it continues computing, perhaps issuing more requests. The main challenge in issuing multiple requests asynchronously is that the client must then match the responses with the outstanding requests.
一旦系统能够同时处理多个请求,设计人员就可以使用交错技术进一步提高吞吐量。其思路是创建瓶颈阶段的n 个实例并同时运行这n 个实例(见图6.4)。阶段 1 将第一个请求提供给实例 1,将第二个请求提供给实例 2,依此类推。如果单个实例的吞吐量为t,则使用交错技术的吞吐量为n × t,假设有足够的请求可用于全速同时运行所有实例,并且请求之间不会相互干扰。交错技术的代价是瓶颈阶段的额外副本。
Once the system is organized to have many requests in flight concurrently, a designer may be able to improve throughput further by using interleaving. The idea is to make n instances of the bottleneck stage and run those n instances concurrently (see Figure 6.4). Stage 1 feeds the first request to instance 1, the second request to instance 2, and so on. If the throughput of a single instance is t, then the throughput using interleaving is n × t, assuming enough requests are available to run all instances concurrently at full speed and the requests don’t interfere with each other. The cost of interleaving is additional copies of the bottleneck stage.
图 6.4交叉请求。
Figure 6.4 Interleaving requests.
RAID(请参见第 2.1.1.4 节)交错使用多个磁盘来实现较高的总磁盘吞吐量。RAID 0 将数据条带化到磁盘上:它将块 0 存储在磁盘 0 上,将块 1 存储在磁盘 1 上,依此类推。如果有对不同磁盘上的块的请求到达,则 RAID 控制器可以同时处理这些请求,从而提高吞吐量。类似地,可以交错使用内存芯片来提高吞吐量。如果当前指令存储在内存芯片 0 中,而下一条指令存储在内存芯片 1 中,则处理器可以同时检索它们。这种设计的成本是额外的磁盘和内存芯片,但系统通常已经具有多个内存芯片或磁盘,在这种情况下,与性能优势相比,交错增加的成本可能很小。
RAID (see Section 2.1.1.4) interleaves several disks to achieve a high aggregate disk throughput. RAID 0 stripes the data across the disks: it stores block 0 on disk 0, block 1 on disk 1, and so on. If requests arrive for blocks on different disks, the RAID controller can serve those requests concurrently, improving throughput. In a similar style one can interleave memory chips to improve throughput. If the current instruction is stored in memory chip 0 and the next one is in memory chip 1, the processor can retrieve them concurrently. The cost of this design is the additional disks and memory chips, but often systems already have several memory chips or disks, in which case the added cost of interleaving can be small in comparison with the performance benefit.
如果图 6.3中的某个阶段以其容量运行(例如,所有物理处理器都在运行线程),则新请求必须等待,直到该阶段可用;请求队列会建立起来,等待繁忙的阶段,而其他阶段可能会空闲运行。例如,第 5.5 节的线程管理器维护一个线程表,该表记录线程是否可运行;可运行线程必须等待,直到有处理器可以运行它。当其他阶段空闲时,使用输入队列运行的阶段是瓶颈。
If a stage in Figure 6.3 operates at its capacity (e.g., all physical processors are running threads), then a new request must wait until the stage becomes available; a queue of requests builds up waiting for the busy stage, while other stages may run idle. For example, the thread manager of Section 5.5 maintains a table of threads, which records whether a thread is runnable; a runnable thread must wait until a processor is available to run it. The stage that runs with an input queue while other stages are running idle is a bottleneck.
使用排队理论*我们可以估算请求在队列中等待被处理的时间(例如,线程在就绪队列中花费的时间)。在排队理论中,处理请求所需的时间(例如,从线程在处理器上开始运行到让出的时间)称为服务时间。最简单的排队理论模型假设请求(例如,进入就绪队列的线程)按照随机、无记忆的过程到达,并且具有独立的指数分布的服务时间。在这种情况下,一个著名的排队理论结果告诉我们,以平均服务时间为单位测量的平均排队延迟(包括此请求的服务时间)将为 1/(1−ρ),其中 ρ 是服务利用率。因此,当利用率接近 1 时,排队延迟将无限增长。
Using queuing theory* we can estimate the time that a request spends waiting in a queue for its turn to be processed (e.g., the time a thread spends in the ready queue). In queuing theory, the time that it takes to process a request (e.g., the time from when a thread starts running on the processor until it yields) is called the service time. The simplest queuing theory model assumes that requests (e.g., a thread entering the ready queue) arrive according to a random, memoryless process and have independent, exponentially distributed service times. In that case, a well-known queuing theory result tells us that the average queuing delay, measured in units of the average service time and including the service time of this request, will be 1/(1−ρ), where ρ is the service utilization. Thus, as the utilization approaches 1, the queuing delay will grow without bound.
这种现象同样适用于等待处理器的线程延迟以及客户在超市结账时遇到的延迟。只要对服务的需求来自许多统计上独立的来源,负载的到达就会出现波动,从而导致瓶颈阶段的队列长度和等待服务的时间出现波动。服务请求的到达率称为提供的负载。只要提供的负载在一段时间内大于服务的容量,就称该服务在该时间段内过载。
This same phenomenon applies to the delays for threads waiting for a processor and to the delays that customers experience in supermarket checkout lines. Any time the demand for a service comes from many statistically independent sources, there will be fluctuations in the arrival of load and thus in the length of the queue at the bottleneck stage and the time spent waiting for service. The rate of arrival of requests for service is known as the offered load. Whenever the offered load is greater than the capacity of a service for some duration, the service is said to be overloaded for that time period.
在某些受限情况下,设计人员可以规划系统,使容量刚好与提供的请求负载相匹配,这样就可以计算出实现高吞吐量所需的并发度以及各阶段之间所需的最大队列长度。例如,假设我们有一个处理器,它使用需要 10 纳秒才能响应的内存,每纳秒执行一条指令。为了避免让处理器等待内存,它必须在需要内存的指令之前提前 10 条指令发出内存请求。如果每条指令都发出内存请求,那么当内存响应时,处理器将发出另外 9 条请求。因此,为了避免成为瓶颈,内存必须准备好同时处理 10 个请求。
In some constrained cases, where the designer can plan the system so that the capacity just matches the offered load of requests, it is possible to calculate the degree of concurrency necessary to achieve high throughput and the maximum length of the queue needed between stages. For example, suppose we have a processor that performs one instruction per nanosecond using a memory that takes 10 nanoseconds to respond. To avoid having the processor wait for the memory, it must make a memory request 10 instructions in advance of the instruction that needs it. If every instruction makes a request of memory, then by the time the memory responds, the processor will have issued 9 more. To avoid being a bottleneck, the memory therefore must be prepared to serve 10 requests concurrently.
如果有一半的指令请求内存,那么平均会有五个未完成的请求。因此,可以同时满足五个请求的内存将有足够的容量来满足需求。计算这种情况所需的最大队列长度取决于应用程序的内存引用模式。例如,如果每隔一条指令发出一个内存请求,则大小为 5 的固定大小队列足以确保队列永不溢出。如果处理器执行五条进行内存引用的指令,然后执行五条不进行内存引用的指令,那么大小为 5 的固定大小队列将起作用,但队列长度会有所不同,吞吐量也会不同。如果请求随机到达,则队列原则上可以无限制地增长。如果我们要使用可以同时处理 10 个请求的内存来处理这种随机的内存引用模式,那么内存的利用率将达到 50%,平均队列长度将为 (1/(120.5) = 2。在这种配置下,处理器会观察到某些内存请求的延迟为 20 个或更多的指令周期,并且运行速度比设计人员预期的要慢得多。这个例子说明,设计人员必须了解内存引用中的非均匀模式并利用它们来实现良好的性能。
If half of the instructions make a request of memory, then on average there will be five outstanding requests. Thus, a memory that can serve five requests concurrently would have enough capacity to keep up. To calculate the maximum length of the queue needed for this case depends on the application’s pattern of memory references. For example, if every second instruction makes a memory request, a fixed-size queue of size five is sufficient to ensure that the queue never overflows. If the processor performs five instructions that make memory references followed by five that don’t, then a fixed-size queue of size five will work, but the queue length will vary in length and the throughput will be different. If the requests arrive randomly, the queue can grow, in principle, without limit. If we were to use a memory that can handle 10 requests concurrently for this random pattern of memory references, then the memory would be utilized at 50% of capacity, and the average queue length would be (1/(120.5) = 2. With this configuration, the processor observes latencies for some memory requests of 20 or more instruction cycles, and it is running much slower than the designer expected. This example illustrates that a designer must understand non-uniform patterns in the references to memory and exploit them to achieve good performance.
在许多计算机系统中,设计人员无法精确地规划提供的负载,因此阶段会经历过载期。例如,应用程序可能有多个线程同时处于可运行状态,但可能没有足够的处理器来运行它们。在这种情况下,至少偶尔的过载是不可避免的。过载的重要性主要取决于它持续的时间。如果持续时间与服务时间相当,那么队列只是一种有序的方式,用于延迟某些服务请求,直到提供的负载低于服务容量为止。换句话说,队列通过与容量过剩的相邻时段进行时间平均来处理短时间的过多需求。
In many computer systems, the designer cannot plan the offered load that precisely, and thus stages will experience periods of overload. For example, an application may have several threads that become runnable all at the same time and there may not be enough processors available to run them. In such cases, at least occasional overload is inevitable. The significance of overload depends critically on how long it lasts. If the duration is comparable to the service time, then a queue is simply an orderly way to delay some requests for service until a later time when the offered load drops below the capacity of the service. Put another way, a queue handles short bursts of too much demand by time-averaging with adjacent periods when there is excess capacity.
如果过载持续很长一段时间,系统设计人员只有两种选择:
If overload persists over long periods of time, the system designer has only two choices:
1.增加系统容量。如果系统必须满足提供的负载,一种方法是设计一个开销较少的系统,以便它可以执行更多有用的工作,或者购买具有更高容量的更好的计算机系统。在计算机系统中,由于技术改进,购买具有更高容量的下一代计算机系统通常比试图通过复杂的算法从实现中榨干最后一点成本更低。
1. Increase the capacity of the system. If the system must meet the offered load, one approach is to design a system that has less overhead so that it can perform more useful work or purchase a better computer system with higher capacity. In computer systems, it is typically less expensive to buy the next generation of the computer system that has higher capacity because of technology improvements than trying to squeeze the last ounce out of the implementation through complex algorithms.
2.减少负载。如果无法购买容量更大的计算机系统,并且系统性能无法提高,则首选方法是通过减少或限制提供的负载来减少负载,直到负载低于系统容量。
2. Shed load. If purchasing a computer system with higher capacity isn’t an option and system performance cannot be improved, the preferred method is to shed load by reducing or limiting the offered load until the load is less than the capacity of the system.
控制提供的负载的一种方法是使用阶段之间的有界缓冲区(见图5.5)。当瓶颈阶段之前的有界缓冲区已满时,它之前的阶段必须等待,直到有界缓冲区清空一个槽。由于前一个阶段正在等待,其有界缓冲区也可能已满,这可能导致它之前的阶段等待,依此类推。瓶颈可能会被推回到管道的开头。如果发生这种情况,系统将无法再接受任何输入,接下来会发生什么取决于系统的使用方式。
One approach to control the offered load is to use a bounded buffer (see Figure 5.5) between stages. When the bounded buffer ahead of the bottleneck stage is full, then the stage before it must wait until the bounded buffer empties a slot. Because the previous stage is waiting, its bounded buffer may fill up too, which may cause the stage before it to wait, and so on. The bottleneck may be pushed all the way back to the beginning of the pipeline. If this happens, the system cannot accept any more input, and what happens next depends on how the system is used.
如果负载源需要输出结果来生成下一个请求,则负载将进行自我管理。这种使用模型适用于某些交互式系统,在这些系统中,用户在前一个命令完成之前无法键入下一个命令。第 7 章 [在线] 将使用相同的想法来实现自定步调的网络协议。
If the source of the load needs the results of the output to generate the next request, then the load will be self-managing. This model of use applies to some interactive systems, in which the users cannot type the next command until the previous one finishes. This same idea will be used in Chapter 7 [on-line] in the implementation of self-pacing network protocols.
如果负载源决定根本不发出请求,则提供的负载会减少。但是,如果负载源只是保留请求并稍后重新提交,则提供的负载不会减少,但某些请求只是被推迟,可能推迟到系统不过载的时候。
If the source of the load decides not to make the request at all, then the offered load decreases. If the source, however, simply holds on to the request and resubmits it later, then the offered load doesn’t decrease, but some requests are just deferred, perhaps to a time when the system isn’t overloaded.
限制源的一种粗略方法是对源可以处理的未完成请求数设置配额。例如,某些系统强制执行一条规则,即应用程序不得同时创建超过某个固定数量的活动线程,并且打开的文件数不得超过某个固定数量。如果源已达到给定服务的配额,则系统将拒绝下一个请求,从而限制系统提供的负载。
A crude approach to limiting a source is to put a quota on how many requests a source may have outstanding. For example, some systems enforce a rule that an application may not create more than some fixed number of active threads at the same time and may not have more than some fixed number of open files. If a source has reached its quota for a given service, the system denies the next request, limiting the offered load on the system.
限制提供的负载的另一种方法是,当某个阶段过载时减少负载。我们将在第 6.2 节中看到这种方法的一个示例。如果许多应用程序的地址空间无法放入内存中,虚拟内存管理器可以换出一个或多个应用程序的完整地址空间,以便剩余的应用程序可以放入内存中。当提供的负载降低到正常水平时,虚拟内存管理器可以换入一些被换出的应用程序。
An alternative to limiting the offered load is reducing it when a stage becomes overloaded. We will see one example of this approach in Section 6.2. If the address spaces of a number of applications cannot fit in memory, the virtual memory manager can swap out a complete address space of one or more applications so that the remaining applications fit in memory. When the offered load decreases to normal levels, the virtual memory manager can swap in some of the applications that were swapped out.
如果设计人员无法使用上面描述的技术消除瓶颈,那么也许可以使用以下三种不同技术中的一种或多种来解决瓶颈:批处理、拖延和推测。
If the designer cannot remove a bottleneck with the techniques described above, it may be possible instead to fight the bottleneck using one or more of three different techniques: batching, dallying, and speculation.
批处理是将多个请求作为一个组来执行,以避免一次执行一个请求的设置开销。批处理的机会自然出现在瓶颈阶段,该阶段可能有一个等待处理的请求队列。例如,如果一个阶段有多个请求要发送到下一个阶段,则该阶段可以将所有消息合并为一条消息,然后将该消息发送到下一个阶段。这种批处理用法将昂贵操作(例如,发送消息)的开销分摊到几条消息上。更一般地说,当处理请求具有固定延迟(例如,传输请求)和可变延迟(例如,执行请求中指定的操作)时,批处理效果很好。如果不使用批处理,处理n 个请求需要n × ( f + v ),其中f是固定延迟,v是可变延迟。使用批处理后,处理n 个请求需要f + n × v。
Batching is performing several requests as a group to avoid the setup overhead of doing them one at a time. Opportunities for batching arise naturally at a bottleneck stage, which may have a queue of requests waiting to be processed. For example, if a stage has several requests to send to the next stage, the stage can combine all of the messages into a single message and send that one message to the next stage. This use of batching divides the overhead of an expensive operation (e.g., sending a message) over the several messages. More generally, batching works well when processing a request has a fixed delay (e.g., transmitting the request) and a variable delay (e.g., performing the operation specified in the request). Without batching, processing n requests takes n × (f + v), where f is the fixed delay and v is the variable delay. With batching, processing n requests takes f + n × v.
一旦某个阶段执行了批处理,就有可能获得额外的性能提升。批处理可能会为该阶段创造避免工作的机会。如果一批中的两个或多个写入请求针对的是同一个磁盘块,那么该阶段可以只执行最后一个请求。
Once a stage performs batching, the potential arises for additional performance wins. Batching may create opportunities for the stage to avoid work. If two or more write requests in a batch are for the same disk block, then the stage can perform just the last one.
批处理还可以通过重新排序请求的处理来提供改善延迟的机会。正如我们将在6.3.4 节中看到的那样,如果磁盘控制器收到一批请求,它可以按减少磁盘臂移动的顺序安排这些请求,从而减少该批请求的总延迟。
Batching may also provide opportunities to improve latency by reordering the processing of requests. As we will see in Section 6.3.4, if a disk controller receives a batch of requests, it can schedule them in an order that reduces the movement of the disk arm, reducing the total latency for the batch of requests.
拖延是指延迟请求,因为该操作可能不需要,或者是为了创造更多的批处理机会。例如,某个阶段可能会延迟覆盖磁盘块的请求,希望第二个请求会出现在同一个块上。如果出现第二个请求,该阶段可以删除第一个请求并只执行第二个请求。当应用于写入时,这种好处有时称为写入吸收。
Dallying is delaying a request on the chance that the operation won’t be needed, or to create more opportunities for batching. For example, a stage may delay a request that overwrites a disk block in the hope that a second one will come along for the same block. If a second one comes along, the stage can delete the first request and perform just the second one. As applied to writes, this benefit is sometimes called write absorption.
拖延还会增加批处理的机会。它故意增加某些请求的延迟,希望有更多的请求可以与延迟的请求合并形成批处理。在这种情况下,拖延会增加某些请求的延迟,以提高所有请求的平均延迟。
Dallying also increases the opportunities for batching. It purposely increases the latency of some requests in the hope that more requests will come along that can be combined with the delayed requests to form a batch. In this case, dallying increases the latency of some requests to improve the average latency of all requests.
拖延的一个关键设计问题是决定等待多长时间。这个问题没有通用的答案。拖延的成本和收益因应用程序和系统而异。
A key design question in dallying is to decide how long to wait. There is no generic answer to this question. The costs and benefits of dallying are application and system specific.
推测是在收到请求之前执行操作,以防万一。其目标是以更少的延迟和更少的设置开销提供结果。推测可以通过两种不同的方式实现这一目标。首先,推测可以使用原本闲置的资源执行操作。在这种情况下,即使推测是错误的,执行额外的操作也没有任何不利之处。其次,推测可以使用繁忙的资源来执行具有较长前置时间的操作,以便在需要时无需等待即可获得操作结果。在这种情况下,推测可能会增加其他请求的延迟和开销而没有任何好处,因为可能需要结果的预测可能是错误的。
Speculation is performing an operation in advance of receiving a request on the chance that it will be requested. The goal is that the results can be delivered with less latency and perhaps with less setup overhead. Speculation can achieve this goal in two different ways. First, speculation can perform operations using otherwise idle resources. In this case, even if the speculation is wrong, performing the additional operations has no downside. Second, speculation can use a busy resource to do an operation that has a long lead time so that the result of the operation can be available without waiting if it turns out to be needed. In this case, speculation might increase the delay and overhead of other requests without benefit because the prediction that the results may be needed might turn out to be wrong.
推测听起来可能令人困惑,因为如果计算机系统尚未收到请求,它如何预测操作的输入,又如何预测该操作的结果在将来是否有用?幸运的是,许多应用程序都具有请求模式,系统设计人员可以利用这些模式来预测输入。在某些情况下,输入值是显而易见的;例如,未来的指令可能会将寄存器 5 添加到寄存器 9,并且这些寄存器的值现在可能可用。在某些情况下,输入值可以准确预测;例如,要求读取字节n的程序很可能也想要读取字节n + 1、n + 2 等等。同样,对于许多应用程序,系统可以预测什么结果在将来会有用。如果程序执行指令n ,它可能很快就需要指令n + 1的结果;只有当指令n是JMP时,预测才会错误。
Speculating may sound bewildering because how can a computer system predict the input of an operation if it hasn’t received the request yet, and how can it predict if the result of the operation will be useful in the future? Fortunately, many applications have request patterns that a system designer can exploit to predict an input. In some cases, the input value is evident; for example, a future instruction may add register 5 to register 9, and these register values may be available now. In some cases, the input values can be predicted accurately; for example, a program that asks to read byte n is likely to want to read bytes n + 1, n + 2, and so on, too. Similarly, for many applications a system can predict what results will be useful in the future. If a program performs instruction n, it will likely soon need the result of instruction n + 1; only when the instruction n is a JMP will the prediction be wrong.
有时,即使系统无法准确预测操作的输入是什么或结果是否有用,系统也可以使用推测。例如,如果输入只有两个值,则系统可能会创建一个新线程,让主线程使用一个输入值运行,让第二个线程使用另一个输入值运行。稍后,当系统知道输入的值时,它会终止使用错误值进行计算的线程,并撤消该线程可能进行的任何更改。当涉及由不同线程更新的共享状态时,这种推测的使用会变得具有挑战性,但使用第 9 章 [在线] 中介绍的技术,即使涉及共享状态,也可以撤消线程的操作。
Sometimes a system can use speculation even if the system cannot predict accurately what the input to an operation is or whether the result will be useful. For example, if an input has only two values, then the system might create a new thread and have the main thread run with one input value and the second thread with the other input value. Later, when the system knows the value of the input, it terminates the thread that is computing with the wrong value and undoes any changes that thread might have made. This use of speculation becomes challenging when it involves shared state that is updated by different thread, but using techniques presented in Chapter 9 [on-line] it is possible to undo the operations of a thread, even when shared state is involved.
推测为批处理和延迟创造了更多机会。如果系统推测对块n的读取请求后面会跟着对块n + 1 到n + 8 的读取请求,那么系统可以对这些读取请求进行批处理。如果一个写入请求后面可能很快会有另一个写入请求,那么系统可以延迟一段时间,看看是否有其他请求进入,如果有,则将所有写入一起进行批处理。
Speculation creates more opportunities for batching and dallying. If the system speculates that a read request for block n will be followed by read requests for blocks n + 1 through n + 8, then the system can batch those read requests. If a write request might soon be followed by another write request, the system can dally for a while to see if any others come in and, if so, batch all the writes together.
与推测相关的关键设计问题是何时推测以及推测多少。推测会增加后续阶段的负载。如果负载增加导致负载高于后续阶段的容量,则请求必须等待,延迟将增加。此外,任何被证明无用的工作都是开销,执行这些不必要的工作可能会减慢其他请求的速度。这个设计问题没有通用的答案;相反,设计人员必须在系统环境中评估推测的收益和成本。
Key design questions associated with speculation are when to speculate and how much. Speculation can increase the load on later stages. If this increase in load results in a load higher than the capacity of a later stage, then requests must wait and latency will increase. Also, any work done that turns out to be not useful is overhead, and performing this unnecessary work may slow down other requests. There is no generic answer to this design question; instead, a designer must evaluate the benefits and cost of speculation in the context of the system.
批处理、拖延和推测引入了复杂性,因为它们引入了并发性。设计人员必须协调传入请求与批处理、拖延或推测的请求。此外,如果请求的操作共享变量,设计人员必须协调对这些变量的引用。由于协调很难做到正确,设计人员必须严格使用这些性能增强技术。总是存在这样的风险:当设计人员解决了并发问题并且系统通过了系统测试时,技术改进将使额外的复杂性变得不必要。问题集14探讨了几种性能增强技术及其在简单多线程服务中面临的挑战。
Batching, dallying, and speculation introduce complexity because they introduce concurrency. The designer must coordinate incoming requests with the requests that are batched, dallied, or speculated. Furthermore, if the requested operations share variables, the designer must coordinate the references to these variables. Since coordination is difficult to get right, a designer must use these performance-enhancing techniques with discipline. There is always the risk that by the time the designer has worked out the concurrency problems and the system has made it through the system tests, technology improvements will have made the extra complexity unnecessary. Problem set 14 explores several performance-enhancing techniques and their challenges with a simple multithreaded service.
我们通过一个涉及磁盘的案例研究(如侧栏 2.2中所述)来说明使用批处理、延迟和推测进行的性能设计。磁盘的性能问题在于它们是由机械部件制成的。因此,与没有机械部件(如 RAM 芯片)的设备相比,磁盘的读写速度较慢。因此,磁盘是许多应用程序中的瓶颈。此瓶颈通常称为I/O 瓶颈。
We illustrate design for performance using batching, dallying, and speculation through a case study involving a magnetic disk such as was described in Sidebar 2.2. The performance problem with disks is that they are made of mechanical components. As a result, reading and writing data to a magnetic disk is slow compared to devices that have no mechanical components, such as RAM chips. The disk is therefore a bottleneck in many applications. This bottleneck is usually referred to as the I/O bottleneck.
回想一下边栏 2.2,读取和写入磁盘块的性能取决于 (1) 将磁头移动到相应磁道的时间(寻道延迟);(2) 加上等待所请求的扇区在磁头下方旋转的时间(旋转延迟);(3) 加上将数据从磁盘传输到计算机的时间(传输延迟)。
Recall from Sidebar 2.2 that the performance of reading and writing a disk block is determined by (1) the time to move the head to the appropriate track (the seek latency); (2) plus the time to wait until the requested sector rotates under the disk head (the rotational latency); (3) plus the time to transfer the data from the disk to the computer (the transfer latency).
I/O 瓶颈问题随着时间的推移而恶化。寻道延迟和旋转延迟的改善速度不如处理器性能的改善速度快。因此,从运行在越来越快的处理器上的程序的角度来看,I/O 随着时间的推移变得越来越慢。这个问题是由于技术改进速度不相称而导致的问题的一个例子。根据第 1 章的不相称扩展规则,在过去的几十年里,应用程序和系统已经经过多次重新设计,以应对 I/O 瓶颈问题。
The I/O bottleneck is getting worse over time. Seek latency and rotational latency are not improving as fast as processor performance. Thus, from the perspective of programs running on ever faster processors, I/O is getting slower over time. This problem is an example of problems due to incommensurate rates of technology improvement. Following the incommensurate scaling rule of Chapter 1, applications and systems have been redesigned several times over the last few decades to cope with the I/O bottleneck.
为了直观地了解 I/O 瓶颈,我们来看看过去十年的典型磁盘。平均寻道延迟(将磁头移动到三分之一磁盘的时间)约为 8 毫秒。磁盘以每分钟 7,200 转的速度旋转,即每 8.33 毫秒旋转一圈。平均而言,磁盘必须等待半圈才能将所需的块放到磁头下方;因此,平均旋转延迟为 4.17 毫秒。
To build some intuition for the I/O bottleneck, consider a typical disk of the last decade. The average seek latency (the time to move the head over one-third of the disk) is about 8 milliseconds. The disks spin at 7,200 rotations per minute, which is one rotation every 8.33 milliseconds. On average, the disk has to wait a half rotation for the desired block to be under the disk head; thus, the average rotational latency is 4.17 milliseconds.
从磁盘读取的位会遇到两个潜在的传输速率限制,其中任何一个都可能成为瓶颈。第一个限制是机械的:位在前往缓冲区的途中在磁盘头下旋转的速率。第二个限制是电气的:I/O 通道或 I/O 总线将缓冲区的内容传输到计算机的速率。典型的现代 400 GB 磁盘具有 16,383 个柱面,或每个柱面大约 24 MB。该磁盘可能有 8 个双面盘片,因此有 16 个读/写磁头,所以每个磁道有 24/16 = 1.5 MB。当以每分钟 7,200 转(每秒 120 转)的速度旋转时,位将以每秒 120 × 1.5 = 180 MB 的速度通过磁头。I/O 通道速度取决于将磁盘连接到计算机的标准总线。对于集成设备电子 (IDE) 总线,66 兆字节每秒是实际的常用数字;对于串行 ATA 3 总线,极限是 3 千兆字节每秒。因此,IDE 总线在 66 兆字节每秒时会成为瓶颈;对于串行 ATA 3 总线,磁盘机制在 180 兆字节每秒时会成为瓶颈。
Bits read from a disk encounter two potential transfer rate limits, either of which may become the bottleneck. The first limit is mechanical: the rate at which bits spin under the disk heads on their way to a buffer. The second limit is electrical: the rate at which the I/O channel or I/O bus can transfer the contents of the buffer to the computer. A typical modern 400-gigabyte disk has 16,383 cylinders, or about 24 megabytes per cylinder. That disk would probably have 8 two-sided platters and thus 16 read/write heads, so there would be 24/16 = 1.5 megabytes per track. When rotating at 7,200 revolutions per minute (120 revolutions per second), the bits will go by a head at 120 × 1.5 = 180 megabytes per second. The I/O channel speed depends on which standard bus connects the disk to the computer. For the Integrated Device Electronics (IDE) bus, 66 megabytes per second is a common number in practice; for the Serial ATA 3 bus the limit is 3 gigabytes per second. Thus, the IDE bus would be the bottleneck at 66 megabytes per second; with a Serial ATA 3 bus, the disk mechanics would be the bottleneck at 180 megabytes per second.
使用这样的磁盘和 I/O 标准,读取随机选择的 4 KB 块需要:
Using such a disk and I/O standard, reading a 4-kilobyte block chosen at random takes:
平均寻道时间 + 平均旋转延迟 + 4 千字节的传输
average seek time + average rotation latency + transmission of 4 kilobytes
= 8 + 4.17 + (4 / (180 × 1024)) × 1000 毫秒
= 8 + 4.17 + (4 / (180 × 1024)) × 1000 milliseconds
= 8 + 4.17 + 0.02 毫秒
= 8 + 4.17 + 0.02 milliseconds
= 12.19 毫秒
= 12.19 milliseconds
逐个读取随机选择的块的吞吐量为:
The throughput for reading randomly chosen blocks one by one is:
= 1000/12.19 × 4 千字节每秒
= 1000/12.19 × 4 kilobytes per second
= 328 千字节/秒
= 328 kilobytes per second
处理 I/O 瓶颈的主要方法是以传输速率(每秒 180 兆字节)而不是寻道和旋转速率(每秒 327 千字节)驱动磁盘。此策略是通过利用吞吐量(计算机和磁盘之间的高传输速率)来隐藏延迟(移动磁盘臂)的一个例子。
The main opportunity to handle the I/O bottleneck is to drive the disk at the transfer rate (180 megabytes per second) instead of the rate of seeks and rotations (327 kilobytes per second). This strategy is an example of hiding latency (moving the disk arm) by exploiting throughput (the high transfer rate between computer and disk).
考虑以下原型程序,它按顺序处理大型输入文件并按顺序生成输出文件:
Consider the following prototypical program, which processes a large input file sequentially and produces an output file sequentially:
1 in ← OPEN (“in”, READ ) // 打开“in”进行读取
1 in ← OPEN(“in”, READ) // open “in” for reading
2 out ← OPEN (“out”, WRITE ) // 打开“out”进行读取
2 out ← OPEN (“out”, WRITE) // open “out” for reading
3
3
4 while not ENDOFFILE ( in ) do
4 while not ENDOFFILE (in) do
5 block ← READ ( in , 4096) // 从in读取 4 千字节块
5 block ← READ (in, 4096) // read 4 kilobyte block from in
6 block ← COMPUTE ( block ) // 计算 1 毫秒
6 block ← COMPUTE (block) // compute for 1 millisecond
7 WRITE ( out , block , 4096) // 将 4 千字节块写入out
7 WRITE (out, block, 4096) // write 4 kilobyte block to out
8 关闭(在)
8 CLOSE (in)
9 收盘(出局)
9 CLOSE (out)
如果我们将该应用程序视为一个管道,则它包含以下阶段:(1)文件系统,响应READ从磁盘读取数据(第 5 行);(2)应用程序,使用读取的数据计算新数据(第 6 行);(3)文件系统,将新数据写入磁盘(第 7 行)。
If we think of this application as a pipeline, then there are the following stages: (1) the file system, which reads data from a disk in response to a READ (line 5); (2) the application, which computes new data using the data read (line 6); and (3) the file system, which writes the new data to the disk (line 7).
如果应用程序的组织方式简单,没有批处理、拖延和推测,则循环的平均时间等于三个阶段的延迟。两个文件系统阶段的延迟主要由磁盘操作的延迟决定,因此我们可以估算循环的平均延迟,如下所示:
If the application is organized naively, without batching, dallying, and speculation, the average time to go around the loop is equal to the latency of the three stages. The latencies of the two file system stages are dominated by the latency of the disk operations, and thus we can approximate the average latency of the loop as follows:
读取 4 千字节 + 1 毫秒的计算 + 写入 4 千字节
reading 4 kilobytes + 1 millisecond of computation + writing 4 kilobytes
= 12.19 + 1 + 12.19 毫秒
= 12.19 + 1 + 12.19 milliseconds
= 25.38 毫秒
= 25.38 milliseconds
实际上,延迟可能会更低,因为此计算假设每次磁盘访问都涉及平均寻道时间,但如果文件系统在磁盘上将块分配得彼此靠近,则磁盘可能只需执行短暂的寻道。
In practice, the latency might be lower because this calculation assumes that each disk access involves an average seek time, but if the file system has allocated the blocks near each other on the disk, the disk might have to perform only a short seek.
我们如何才能提高这个程序的性能?程序只读取文件一次,因此缓存无法改善读取块的延迟。唯一的选择是隐藏读写操作的延迟。最简单的优化是将块的读写与第 6 行的计算重叠。让我们从读取开始。
How can we improve the performance of this program? The program reads the file only once, and thus a cache cannot improve the latency of reading a block. The only alternative is to hide the latency of read and write operations. The simplest optimization is to overlap the reading and writing of blocks with the computation on line 6. Let’s start with reading.
当应用程序读取一个块时,文件系统可以推测应用程序将读取请求的块之后的几个块。如果我们将这种推测与另外两种优化相结合,则可以提高应用程序的性能。首先,我们修改文件系统以连续地布置文件的块。其次,我们修改文件系统以在每次读取时预取整个数据轨道。我们的原型应用程序非常适合预取,因为整个数据集是按顺序读取的。
When the application READs a block, the file system can speculate that the application will read a few blocks following the requested block. This speculation can improve performance for our application if we combine it with two further optimizations. First, we modify the file system to lay out the blocks of a file contiguously. Second, we modify the file system to prefetch an entire track of data on each read. Our prototypical application is perfect for prefetching, since the whole data set is read sequentially.
这些优化消除了读取开始前的旋转延迟。可以读取整个磁道:
These optimizations eliminate rotational delay before reading can start. An entire track can be read in:
平均寻道时间 + 1 个旋转延迟
average seek time + 1 rotational delay
= 8 + 8.33 毫秒
= 8 + 8.33 milliseconds
= 16.33 毫秒
= 16.33 milliseconds
每个磁道有 1.5 MB(1,536 千字节),文件系统每 384(1536/4)次循环迭代发出一个读取请求,我们有以下时序图:
With 1.5 megabytes (1,536 kilobytes) per track, the file system issues one read request per 384 (1536/4) loop iterations, and we have the following timing diagram:
384次迭代的平均时间为:
The average time for 384 iterations is:
= 读取 1536 千字节 + 384 × (1 毫秒的计算 + 写入 4 千字节)
= reading 1536 kilobytes + 384 × (1 millisecond of computation + writing 4 kilobytes)
= 16.33 + 384 × (1 + 12.19) 毫秒
= 16.33 + 384 × (1 + 12.19) milliseconds
= 16.33 + 5065 毫秒
= 16.33 + 5065 milliseconds
= 5081 毫秒
= 5081 milliseconds
因此,循环迭代的平均时间为 5081/384 = 13.23 毫秒,比 25.38 毫秒有了大幅改进。
Thus, the average time for a loop iteration is 5081/384 = 13.23 milliseconds, a substantial improvement over 25.38 milliseconds.
我们可以通过拖延和批处理写入请求来提高写入块的性能。我们修改WRITE以使用 RAM 中的块缓冲区(参见图 6.5)。WRITE调用将更新的块存储到此缓冲区中并立即返回,应用程序线程可以继续。当缓冲区填满时,文件系统可以批处理缓冲区中的块并将它们组合成单个磁盘请求,磁盘可以与运行应用程序的处理器并行处理该请求。批处理允许磁盘控制器执行对相邻扇区的写入而没有旋转延迟。由于在我们的示例中块是连续写入的,因此文件系统可能需要 384 个连续写入并将它们批处理在一起以形成完整的磁道写入。这些优化导致以下时序图:
We can improve the performance of writing blocks by dallying and batching write requests. We modify WRITE to use a buffer of blocks in RAM (see Figure 6.5). The WRITE call stores the updated block into this buffer and returns immediately, and the application thread can continue. When the buffers fill up, the file system can batch the blocks in the buffer and combine them into a single disk request, which the disk can process in parallel with the processor running the application. Batching allows the disk controller to execute writes to adjacent sectors with no rotational delay. Because blocks are written contiguously in our example, the file system may take 384 contiguous writes and batch them together to form a complete track write. These optimizations result in the following timing diagram:
图 6.5使用缓冲区延迟写入。
Figure 6.5 Using a buffer to delay writes.
This optimization reduces the average time around the loop:
= (16.33 + 384 + 16.33)/384 毫秒
= (16.33 + 384 + 16.33)/384 milliseconds
= 1.09 毫秒
= 1.09 milliseconds
如果我们修改文件系统,在应用程序调用第 385 次READ之前预取下一个轨道,我们就可以完全重叠计算和 I/O。如果我们修改文件系统,在它处理完最后一个轨道的一半之后读取下一个轨道,那么我们将获得除第一个之外的每个 384 次循环迭代块的以下时序图:
If we modify the file system to prefetch the next track before the application calls the 385th READ, we can overlap computation and I/O completely. If we modify the file system to read the next track after it has processed, say, half of the last track read, then we obtain the following timing diagram for each block of 384 loop iterations, other than the first one:
现在系统将计算与 I/O 完全重叠,循环的平均时间为 1 毫秒,应用程序的瓶颈现在在于计算,而不是 I/O。
Now the system overlaps computation with I/O completely, the average time around the loop is 1 millisecond, and the application is now bottlenecked by computation, rather than by I/O.
优化利用了应用程序按顺序处理输入文件以及文件系统在磁盘上连续分配输出块的事实。但是,即使对于处理块的顺序与块在磁盘上的排列顺序不符的应用程序,这些优化也是有益的。例如,文件系统可以按磁道号的顺序重新排列一批磁盘请求,从而最大限度地减少磁盘臂移动,从而提高整批请求的性能。(要理解什么是好的磁盘调度算法,我们需要更广泛地思考计算机系统中的调度请求,这是第6.3 节的主题。)
The optimizations take advantage of the facts that the application processes the input file sequentially and that the file system allocates blocks for the output contiguously on disk. However, even for applications that process blocks not in the order in which they are laid out on the disk, these optimizations can be beneficial. The file system, for example, can reorder the disk requests for a batch in the order of their track number, thereby minimizing disk arm movement, and thus improving performance for the whole batch of requests. (To understand what a good algorithm is for disk scheduling, we need to think more broadly about scheduling requests in computer systems, which is the topic of Section 6.3.)
该分析假设了磁盘的简单性能模型;有关磁盘性能的更深入讨论,请参阅进一步阅读建议 6.3.1。该分析还假设只有一个磁盘;使用多个磁盘可以提供提高性能的机会。例如,RAID 有多个磁盘(请参阅第 2.1.1.4 节),这允许文件系统交错读写请求,而不是逐个处理它们,从而提供提高性能的额外机会。最后,实用的替代存储技术正在兴起,它们改变了这种权衡。例如,设计带有闪存盘的高性能存储系统提供了新的机遇和挑战(例如,请参阅进一步阅读建议 6.3.4)。
The analysis assumes a simple performance model for the disk; for a more in-depth discussion of the performance of disks, see Suggestions for Further Reading 6.3.1. The analysis also assumes a single disk; using several disks can offer opportunities for improving performance. For example, RAIDs have several disks (see Section 2.1.1.4), which allows the file system to interleave read and write requests instead of serving them one by one, providing additional opportunities for increasing performance. Finally, practical, alternative storage technologies are emerging, which change the trade-offs. For example, designing a high performance storage system with Flash disks provides new opportunities and new challenges (see, for example, Suggestions for Further Reading 6.3.4).
没有直写功能的缓冲区可以显著提高性能,但会降低可靠性。如果计算机系统在文件系统将数据写入磁盘之前发生故障,则部分数据会丢失。基本问题是在强制将数据写入磁盘之前要延迟多长时间。文件系统延迟写入的时间越长,获得更高性能的机会就越大,但如果发生电源故障和易失性 RAM 重置等情况,数据丢失的可能性就越大。
A buffer without write-through can provide substantial performance improvements but can lose on reliability. If the computer system fails before the file system has written out data to the disk, some data is lost. The basic problem is how long to delay before forcing the data to the disk. The longer the file system delays writes, the larger the opportunity for higher performance will be, but the greater the probability that data will be lost if, for example, the power fails and the volatile RAM resets.
关于何时可以向磁盘发出写请求,至少有四种选择:
There are at least four choices as to when the WRITE request to the disk can be issued:
Before WRITE returns to the caller (write-through).
On an explicit force request from the user (user-controlled write).
When the file is closed (another kind of user-controlled write).
当积累了一定数量的写入请求或自上次写入请求以来已经过了一段固定时间时。如果需要控制写入顺序,则此选项可能不是一个好主意。
When a certain number of write requests have been accumulated or when some fixed time has passed since the last write request. This option can be a bad idea if one needs to control the order of writes.
无写通功能的缓冲区还会引入其他一些复杂性,主要与系统故障时的可靠性有关。首先,如果文件系统在单个磁盘请求中批量处理多个写入请求,则磁盘可能会按照与文件系统发出的顺序不同的顺序写入块,以减少寻道时间。因此,如果系统在批量写入请求中途崩溃,磁盘可能无法反映一致的状态。其次,磁盘控制器也可能使用无写通功能的缓冲区。文件系统可能认为数据已可靠地存储在磁盘上,而实际上磁盘控制器正在缓存数据。我们将在第 9 章 [在线] 中看到控制无写通缓存所导致问题的系统方法;Ganger 和 Patt [进一步阅读建议 6.3.3 ] 给出了这些系统方法的一个很好的应用,用于设计高性能和强大的文件系统。总的来说,这里有一个很好的例子,性能的提高是以增加复杂性为代价的,如图1.1所示。
A buffer without write-through also introduces some other complexities, mostly related to reliability in the face of system failures. First, if the file system batches several write requests in a single disk request, then the disk may write the blocks in an order different from the order issued by the file system to reduce seek time. Thus, the disk may not reflect a consistent state if the system crashes halfway through the batched write request. Second, the disk controller may also use a buffer without write-through. The file system may think the data has been stored reliably on disk when, in fact, the disk controller is caching it. We shall see systematic ways of controlling the problem caused by caches without write-through in Chapter 9 [on-line]; a nice application of these systematic ways to design a high-performance and robust file system is given by Ganger and Patt [Suggestions for Further Reading 6.3.3]. In general, here we have a good example that increased performance comes at the cost of increased complexity, as illustrated by Figure 1.1.
原型应用程序代表一种特定的工作负载,上述技术可以很好地提高其性能。提高原型应用程序的性能具有挑战性,因为它不会重复使用块。许多应用程序多次读取和写入一个块,在这种情况下,可以使用其他技术来提高性能。特别是,在这种情况下,文件系统值得在 RAM 中维护最近读取的块的缓存。如果应用程序读取已在缓存中的块,则文件系统不必执行任何磁盘操作。
The prototypical application represents one particular workload for which the techniques described above improve performance well. Improving the performance of the prototypical application is challenging because it doesn’t reuse a block. Many applications read and write a block multiple times, and in that case additional techniques are available to improve performance. In particular, in that case it is worthwhile for the file system to maintain a cache of recently read blocks in RAM. If an application reads a block that is already in the cache, then the file system doesn’t have to perform any disk operations.
引入缓存会导致额外的协调约束。文件系统可能必须协调WRITE和READ操作与未完成的磁盘请求。例如:READ操作可能会强制从缓存中删除已修改的块,以便为要读取的块腾出空间。但是文件系统在将已修改的块写入磁盘之前不能将其丢弃,因此文件系统必须等到已修改块的写入请求完成后才能继续执行READ操作。
Introducing a cache leads to additional coordination constraints. The file system may have to coordinate WRITE and READ operations with outstanding disk requests. For example: a READ operation may force the removal of a modified block from the cache to make space for the block to be read. But the file system cannot throw out a modified block until it has been written it to the disk, so the file system must wait until the write request of the modified block has completed before proceeding with the READ operation.
了解缓存适用于哪些工作负载、学习如何设计缓存(例如,丢弃哪个块以便为新块腾出空间)以及分析缓存的性能优势都是复杂的主题,我们将在下文中进行讨论。问题集16在一个简单的高性能视频服务器的背景下探讨了这些问题以及与调度相关的主题。
Understanding for what workloads a cache works well, learning how to design a cache (e.g., which block to throw out to make space for a new block), and analyzing a cache’s performance benefits are sophisticated topics, which we discuss next. Problem set 16 explores these issues, as well as topics related to scheduling, in the context of a simple high-performance video server.
上一节介绍了如何使用两种类型的数字存储设备解决 I/O 瓶颈问题:RAM 芯片和磁盘,这两种设备具有不同的容量、成本和速度。系统设计人员希望拥有一个既大又快的存储设备,以满足应用程序的要求,同时价格又实惠。不幸的是,应用程序的要求通常超出这三个参数中的一个或另一个——既快又大的存储设备通常太贵了——因此设计人员必须做出一些权衡。通常的权衡是使用多个存储设备,例如,一个速度快但价格昂贵(因此必然太小),另一个大而便宜(但比预期慢)。但是,将应用程序放入这样的环境中会增加复杂性,因为需要决定应用程序的哪些部分应该使用小而快的内存,哪些部分应该使用大而慢的内存。如果内存配置发生变化,也可能会增加维护工作量。
The previous section described how to address the I/O bottleneck by using two types of digital memory devices: a RAM chip and a magnetic disk, which have different capacities, costs, and speeds. A system designer would like to have a single memory device that is both as large and as fast as the application requires, and that at the same time is affordable. Unfortunately, application requirements often exceed one or another of these three parameters—a memory device that is both fast enough and large enough is usually too expensive—so the designer must make some trade-offs. The usual trade-off is to use more than one memory device, for example, one that is fast but expensive (and thus necessarily too small), and another that is large and cheap (but slower than desired). But fitting an application into such an environment adds the complexity of deciding which parts of the application should use the small, fast memory and which parts the large, slow one. It may also increase maintenance effort if the memory configuration changes.
人们可能认为,技术的进步最终可能会使蛮力解决方案变得经济实惠——总有一天,设计师可以购买既大又快的内存。但这种想法有两个问题:一个是实际问题,一个是内在问题。实际问题是,从历史上看,内存大小的增加与问题规模的增加是相等的。也就是说,人们想要处理的数据随着内存技术的发展而增长。
One might think that improvements in technology may eventually make a brute-force solution economical—someday the designer can just buy a memory that is both large and fast enough. But there are two problems with that thought: one practical and one intrinsic. The practical problem is that historically the increase in memory size has been matched by an equal increase in problem sizes. That is, the data that people want to manipulate has grown along with memory technology.
内在问题是内存在延迟和大小之间需要权衡。当我们考虑底层物理时,这种权衡就变得清晰起来。即使有无限的预算可以投入到设计问题中,光速也会干扰。要了解原因,想象一下一个处理器占据空间中的单个点,内存以球体形式聚集在它周围,使用物理允许的最密集的包装。使用这种包装,一些内存单元最终会位于非常靠近处理器的位置,因此延迟(即访问这些单元所需的时间,这需要以光速将信号从处理器传播到位并返回)会很短。但由于只有少数内存单元可以容纳在靠近处理器的空间中,大多数内存单元会离得更远,并且它们必然会有更大的延迟。换句话说,对于任何指定的最小延迟要求,仅基于光速的考虑,就会存在一些内存大小,至少一些单元必须超过该延迟。此外,球体的几何形状(半径为r的壳的体积随着r的平方而增长)决定高延迟单元的数量必定比低延迟单元的数量多。
The intrinsic problem is that memory has a trade-off between latency and size. This trade-off becomes clear when we consider the underlying physics. Even if one has an unlimited budget to throw at the design problem, the speed of light interferes. To see why, imagine a processor that occupies a single point in space, with memory clustered around it in a sphere, using the densest packing that physics allows. With this packing, some of the memory cells will end up located quite near the processor, so the latency (that is, the time required for access to those cells, which requires a propagation of a signal at the speed of light from the processor to the bit and back) will be short. But because only a few memory cells can fit in the space near the processor, most memory cells will be farther away, and they will necessarily have a larger latency. Put another way, for any specified minimum latency requirement, there will be some memory size for which at least some cells must exceed that latency, based on speed-of-light considerations alone. Moreover, the geometry of spheres (the volume of a shell of radius r grows with the square of r) dictates that there must be more high-latency cells than low-latency ones.
从实际工程角度来看,现有技术也存在类似的打包问题。例如,与处理器位于同一芯片上的内存阵列(通常称为 L1 缓存)的延迟小于单独内存芯片(通常称为 L2 缓存)的延迟,而后者又小于作为单独卡上的内存芯片集合实现的更大内存的延迟。结果是,设计人员通常被迫处理复合内存系统,其中不同组件内存具有不同的延迟、容量和成本参数。然后,挑战就变成了通过决定将哪些数据项存储在最快的内存设备中(哪些可以转移到较慢的设备中)以及决定是否以及何时将数据项从一个内存设备移动到另一个内存设备来实现整体最高性能。
In practical engineering terms, available technologies also exhibit analogous packing problems. For example, the latency of a memory array on the same chip as the processor (where it is usually called an L1 cache) is less than the latency of a separate memory chip (which is usually called an L2 cache), which in turn is less than the latency of a much larger memory implemented as a collection of memory chips on a separate card. The result is that the designer is usually forced to deal with a composite memory system in which different component memories have different parameters of latency, capacity, and cost. The challenge then becomes that of achieving overall maximum performance by deciding which data items to store in the fastest memory device, which can be relegated to the slower devices, and deciding if and when to move data items from one memory device to another.
不同的存储设备不仅具有容量、延迟和成本等特征,还具有单元大小和吞吐量等特征。更详细地说,这些维度包括:
Different memory devices are characterized not just by dimensions of capacity, latency, and cost, but also by cell size and throughput. In more detail, these dimensions are:
容量,以位或字节为单位。例如,RAM 芯片的容量可能为几兆到几十兆字节,而磁盘的容量则以几十或几百千兆字节为单位。
Capacity, measured in bits or bytes. For example, a RAM chip may have a capacity from a few to tens of megabytes, whereas magnetic disks have capacities measured in scores or hundreds of gigabytes.
平均随机延迟,以秒或处理器时钟周期为单位,针对随机选择的存储单元进行测量。例如,RAM 的平均延迟以纳秒为单位,这可能相当于数百个处理器时钟周期。(仔细研究后,会发现 RAM READ延迟实际上更为复杂 - 请参阅边栏 6.4。)磁盘的平均延迟以毫秒为单位,这相当于数百万个处理器时钟周期。此外,由于磁盘的机械部件,其延迟差异通常比 RAM 要大得多。
Average random latency, measured in seconds or processor clock cycles, for a memory cell chosen at random. For example, the average latency of RAM is measured in nanoseconds, which might correspond to hundreds of processor clock cycles. (On closer examination, RAM READ latency is actually more complicated—see Sidebar 6.4.) Magnetic disks have an average latency measured in milliseconds, which corresponds to millions of processor clock cycles. In addition, magnetic disks, because of their mechanical components, usually have a much wider variance in their latency than does RAM.
边栏 6.4 RAM 延迟
Sidebar 6.4 RAM Latency
性能分析有时需要更好的随机存取存储器延迟模型来模拟READ操作。大多数随机存取存储器设备实际上有两个相关的延迟参数:周期时间和访问时间。之所以有区别,是因为物理存储器设备可能需要时间从一次访问中恢复,然后才能处理下一次访问。例如,某些存储器READ机制是破坏性的:为了读取一个位,存储器设备会彻底破坏该位并检查产生的碎片以确定该位的值。一旦确定该值,存储器设备就会将该位写回,以便其值可以再次用于未来的READ。此写回操作通常不能与紧随其后的READ操作重叠。因此,存储器设备的周期时间是发出一个READ请求和发出下一个 READ 请求之间必须经过的最短时间。但是, READ的结果可能在周期时间完成之前就可以传送给处理器。从发出READ到向处理器传送响应的时间称为存储器设备的访问时间。下图说明了这一点。
Performance analysis sometimes requires a better model of random access memory latency for READ operations. Most random access memory devices actually have two latency parameters of interest: cycle time and access time. The distinction arises because the physical memory device may need time to recover from one access before it can handle the next one. For example, some memory READ mechanisms are destructive: to READ a bit out, the memory device literally smashes the bit and examines the resulting debris to determine the value that the bit had. Once it determines that value, the memory device writes the bit back so that its value can again be available for future READs. This write-back operation typically cannot be overlapped with an immediately following READ operation. Thus, the cycle time of the memory device is the minimum time that must pass between issuance of one READ request and issuance of the next one. However, the result of the READ may be available for delivery to the processor well before the cycle time is complete. The time from issuance of the READ to delivery of the response to the processor is known as the access time of the memory device. The following figure illustrates.
成本,以每存储单元的某种货币来衡量。RAM 的成本通常以每兆字节美分来衡量,而磁盘的成本则以每千兆字节美元来衡量。
Cost, measured in some currency per storage unit. The cost of RAM is typically measured in cents per megabyte, while the cost of magnetic disks is measured in dollars per gigabyte.
单元大小,以单次READ或WRITE操作传入或传出设备的位数或字节数来衡量。例如,RAM 的单元大小通常为几个字节,可能是 4、8 或 16。磁盘的单元大小通常为 512 字节或更多。
Cell size, measured as the number of bits or bytes transferred in or out of the device by a single READ or WRITE operation. For example, the cell size of RAM is typically a few bytes, perhaps 4, 8, or 16. The cell size of a magnetic disk is typically 512 bytes or more.
吞吐量,以比特/秒为单位。RAM 通常可以以每秒 GB 的速率传输数据,而磁盘可以以每秒数百 MB 的速率传输数据。
Throughput, measured in bits per second. RAM can typically transfer data at rates measured in gigabytes per second, while magnetic disks transfer at the rate of hundreds of megabytes per second.
在所有情况下,RAM 和磁盘在这些方面的区别都是数量级的。RAM 通常比磁盘快 5 个数量级,而价格则高 2 个数量级。许多(但不是全部)方面都在迅速改进。例如,磁盘的容量翻了一番,而成本在过去 20 年里每年下降 2 倍,而平均延迟在这 20 年里只改善了 2 倍。延迟没有改善太多,因为它涉及机械操作而不是全电子操作,如边栏 2.2中所述。这种不相称的技术改进速度使有效的内存管理成为一项难以实现的挑战。
The differences between RAM and magnetic disks along these dimensions are orders of magnitude in all cases. RAM is typically about five orders of magnitude faster than magnetic disk and two orders of magnitude more expensive. Many, but not all, of the dimensions have been improving rapidly. For example, the capacity of magnetic disks has doubled, and cost has fallen by a factor of 2 every year for the last two decades, while the average latency has improved by only a factor of 2 in that same 20 years. Latency has not improved much because it involves mechanical operations as opposed to all-electronic ones, as described in Sidebar 2.2. This incommensurate rate of technology improvement makes effective memory management a challenge to implement well.
因为更大的延迟和更大的容量通常相伴而生,所以习惯上将各种可用的存储设备描述为属于不同的级别,其中最快、最小的设备位于最高级别,而较慢、较大的设备位于较低级别。由多个级别的设备构成的存储系统称为多级存储器。图 6.6使用金字塔显示了描述多个级别的一种流行方式,其中较高级别较窄,表明其容量较小。层次结构顶端的存储器速度快且价格昂贵,因此体积较小;层次结构底部的存储器速度慢且价格便宜;因此它们可以大得多。在现代计算机系统中,信息项可以位于处理器的寄存器、L1 高速缓存、L2 高速缓存、主存储器、RAM 磁盘高速缓存、磁盘,甚至可以通过网络访问的另一台计算机上。
Because larger latency and larger capacity usually go hand in hand, it is customary and useful to describe the various available memory devices as belonging to different levels, with the fastest, smallest device being at the highest level and slower, larger devices being at lower levels. A memory system constructed of devices from more than one level is called a multilevel memory. Figure 6.6 shows a popular way of depicting the multiple levels, using a pyramid, in which higher levels are narrower, suggesting that their capacity is smaller. The memories in the top of the hierarchy are fast and expensive, and they are therefore small; the memories at the bottom of the hierarchy are slow and inexpensive; and so they can be much bigger. In a modern computer system, an information item can be in the registers of the processor, the L1 cache memory, the L2 cache memory, main memory, a RAM disk cache, on a magnetic disk, or even on another computer that is accessible through a network.
图 6.6多级记忆金字塔。
Figure 6.6 A multilevel memory pyramid.
可以使用两种截然不同的方式来管理多级内存。一种方法是让每个应用程序程序员决定将数据项放置在哪个内存中以及何时移动它们。第二种方法是自动管理:独立于任何应用程序的子系统会观察程序进行的内存引用模式。考虑到该模式,自动内存管理子系统会决定将数据项放置在何处以及何时将它们从一个内存设备移动到另一个内存设备。
Two quite different ways can be used to manage a multilevel memory. One way is to leave it to each application programmer to decide in which memory to place data items and when to move them. The second way is automatic management: a subsystem independent of any application program observes the pattern of memory references being made by the program. With that pattern in mind, the automatic memory management subsystem decides where to place data items and when to move them from one memory device to another.
大多数现代内存管理都是自动的,因为 (1) 存在平均性能良好的自动算法,并且 (2) 自动内存管理使程序员无需使程序符合内存系统的具体要求(例如各个级别的容量)。如果没有自动内存管理,应用程序将明确分配每个级别内的内存空间,并将数据项从一个内存级别移动到另一个内存级别。此类程序依赖于编写它们的特定硬件配置,这使得它们难以编写、维护或移动到其他计算机。如果有人在其中一个级别添加更多内存,则可能需要修改程序才能利用它。如果删除一些内存,程序可能会停止工作。
Most modern memory management is automatic because (1) there exist automatic algorithms that have good performance on average and (2) automatic memory management relieves the programmer of the need to conform the program to specifics of the memory system such as the capacities of the various levels. Without automatic memory management, the application program explicitly allocates memory space within each level and moves data items from one memory level to another. Such programs become dependent on the particular hardware configuration for which they were written, which makes them difficult to write, to maintain, or to move to a different computer. If someone adds more memory to one of the levels, the program will probably have to be modified to take advantage of it. If some memory is removed, the program may stop working.
如第 2 章所述,有两种常见的内存接口:一种是线程使用READ和WRITE操作引用的小单元内存接口,另一种是线程使用GET和PUT操作引用的大单元内存接口。这两个接口大致对应于多级内存的级别;较高级别通常具有小单元并使用READ / WRITE接口,而较低级别通常具有大单元并使用GET / PUT接口。
As Chapter 2 described, there are two commonly encountered memory interfaces: an interface to small-cell memory to which threads refer using READ and WRITE operations, and an interface to large-cell memory to which threads refer using GET and PUT operations. These two interfaces correspond roughly to the levels of a multilevel memory; higher levels typically have small cells and use the READ/WRITE interface, while lower levels typically have large cells and use the GET/PUT interface.
设计自动管理的多级内存系统的一个机会是将其与虚拟内存管理器结合起来,这样小单元的READ / WRITE接口在应用程序中看起来就像适用于整个内存系统一样。这就创建了有时被称为单级存储的东西,这个想法最早是在 Atlas 系统中引入的。*换句话说,这种方案围绕小单元的READ / WRITE接口虚拟化了整个内存系统,从而向应用程序程序员隐藏了GET / PUT接口以及延迟、容量、成本、单元大小和组件内存设备的吞吐量等细节。程序员看到的是一个单一的内存系统,它看起来具有大容量、统一的单元大小、每位的平均成本适中,并且延迟和吞吐量取决于应用程序的内存访问模式。
One opportunity in the design of an automatically managed multilevel memory system is to combine it with a virtual memory manager in such a way that the small-cell READ/WRITE interface appears to the application program to apply to the entire memory system. This creates what is sometimes called a one-level store, an idea first introduced in the Atlas system.* Put another way, this scheme virtualizes the entire memory system around the small-cell READ/WRITE interface, thus hiding from the application programmer the GET/PUT interface as well as the specifics of latency, capacity, cost, cell size, and throughput of the component memory devices. The programmer instead sees a single memory system that appears to have a large capacity, a uniform cell size, a modest average cost per bit, and a latency and throughput that depend on the memory access patterns of the application.
就像地址虚拟化一样,READ / WRITE内存接口的虚拟化进一步利用了通过间接方式解耦模块的设计原则。在这种情况下,间接方式允许虚拟内存管理器不仅将任何特定的虚拟地址转换为不同时间的不同物理内存地址,还可以转换为不同内存级别的地址。在虚拟内存管理器的支持下,多级内存管理器可以在内存级别之间重新排列数据,而无需修改任何应用程序。通过添加另一个功能,即间接异常,这种重新排列可以完全自动化。间接异常是一种内存引用异常,表示内存管理器无法转换特定的虚拟地址。异常处理程序将检查虚拟地址,并可能在恢复中断的线程之前绑定或重新绑定该值。
Just as with virtualization of addresses, virtualization of the READ/WRITE memory interface further exploits the design principle decouple modules with indirection. In this case, indirection allows the virtual memory manager to translate any particular virtual address not only to different physical memory addresses at different times but also to addresses in a different memory level. With the support of the virtual memory manager, a multilevel memory manager can then rearrange the data among the memory levels without having to modify any application program. By adding one more feature, the indirection exception, this rearrangement can become completely automatic. An indirection exception is a memory reference exception that indicates that memory manager cannot translate a particular virtual address. The exception handler examines the virtual address and may bind or rebind that value before resuming the interrupted thread.
借助这些技术,虚拟内存管理器不仅可以控制错误并强制模块化,还可以帮助使程序感觉内存是单一、统一、大容量的。多级内存管理功能可以透明地嵌入到应用程序之下,这意味着应用程序无需修改。
With these techniques, the virtual memory manager not only can contain errors and enforce modularity, but it also can help make it appear to the program that there is a single, uniform, large memory. The multilevel memory management feature can be slipped in underneath the application program transparently, which means that the application program does not need to be modified.
广泛使用的接口的虚拟化创造了透明地添加功能并因此发展系统的机会。由于根据定义,许多模块都使用广泛使用的接口,因此在这种接口下透明地添加功能可以产生广泛的影响,而无需更改接口的客户端。内存接口就是这种广泛使用的接口的一个示例。除了实现单级存储之外,系统设计人员还使用具有间接异常的虚拟内存管理器的其他几种方式如下:
Virtualization of widely used interfaces creates an opportunity to transparently add features and thus evolve a system. Since by definition many modules use a widely used interface, the transparent addition of features beneath such an interface can have a wide impact, without having to change the clients of the interface. The memory interface is an example of such a widely used interface. In addition to implementing single-level stores, here are several other ways in which systems designers have used a virtual memory manager with indirection exceptions:
内存映射文件。当应用程序打开文件时,虚拟内存管理器可以将文件映射到应用程序的地址空间,这样应用程序就可以像在 RAM 中一样读取和写入文件的某些部分。内存映射文件将单级存储的概念扩展到包含文件。
Memory-mapped files. When an application opens a file, the virtual memory manager can map files into an application’s address space, which allows the application to read and write portions of a file as if they were located in RAM. Memory-mapped files extend the idea of a single-level store to include files.
写时复制。如果两个线程同时处理同一数据,则可以通过映射仅具有读取权限的数据页面,将数据存储在内存中一次。如果其中一个线程尝试写入共享页面,虚拟内存硬件将以权限异常中断处理器。处理程序可以将此异常解复用为写时复制类型的间接异常。为了响应间接异常,虚拟内存管理器透明地复制页面,并将具有读取和写入权限的副本映射到想要写入页面的线程的地址空间中。使用此技术,只需复制已更改的页面。
Copy-on-write. If two threads are working on the same data concurrently, then the data can be stored once in memory by mapping the pages that hold the data with only READ permissions. If one of the threads attempts to write a shared page, the virtual memory hardware will interrupt the processor with a permission exception. The handler can demultiplex this exception as an indirection exception of the type copy-on-write. In response to the indirection exception, the virtual memory manager transparently makes a copy of the page and maps the copy with READ and WRITE permissions in the address space of the threads that wants to write the page. With this technique, only changed pages must be copied.
按需填充零的页面。应用程序启动时,其大部分地址空间必须用零填充 - 例如,未使用指令或初始数据值预先初始化的地址空间部分。虚拟内存管理器可以在没有读写权限的情况下映射这些页面,而不是在 RAM 或磁盘上分配用零填充的页面。当应用程序引用其中一个页面时,虚拟内存硬件将通过内存引用异常中断处理器。异常处理程序可以将此异常解复用为零填充类型的间接异常。为了响应此零填充异常,虚拟内存管理器将动态分配一个页面并用零填充。这种技术可以节省 RAM 或磁盘上的存储空间,因为应用程序未使用的地址空间部分不会占用空间。
On-demand zero-filled pages. When an application starts, a large part of its address space must be filled with zeros—for instance, the parts of the address space that aren’t preinitialized with instructions or initial data values. Instead of allocating zero-filled pages in RAM or on disk, the virtual memory manager can map those pages without READ and WRITE permissions. When the application refers to one of those pages, the virtual memory hardware will interrupt the processor with a memory reference exception. The exception handler can demultiplex this exception as an indirection exception of the type zero-fill. In response to this zero-fill exception, the virtual memory manager allocates a page dynamically and fills it with zeros. This technique can save storage in RAM or on disk because the parts of the address space that the application doesn’t use will not take up space.
一个全零页。一些设计人员使用写时复制异常来实现全零页。虚拟内存管理器只分配一个全零页,并将该页映射到所有应包含全零的页面的页面映射条目中,但只授予读取权限。然后,如果线程写入此只读全零页,异常处理程序将把此间接异常解复用为写时复制异常,虚拟内存管理器将进行复制并更新该线程的页表以对该副本具有写入权限。
One zero-filled page. Some designers implement zero-filled pages with a copy-on-write exception. The virtual memory manager allocates just one page filled with zeros and maps that one page in all page-map entries for pages that should contain all zeros, but granting only READ permission. Then, if a thread writes to this read-only zero-filled page, the exception handler will demultiplex this indirect exception as a copy-on-write exception, and the virtual memory manager will make a copy and update that thread’s page table to have WRITE permission for the copy.
虚拟共享内存。在不同计算机上运行的多个线程可以共享一个地址空间。当线程引用不在其本地 RAM 中的页面时,虚拟内存管理器可以通过网络从远程计算机的 RAM 中获取该页面。远程虚拟内存管理器取消页面映射并返回页面内容。Apollo DOMAIN系统(在“进一步阅读建议”3.2.1中提到)使用这个想法使分布式计算机集合看起来像一台计算机。Li 和 Hudak 使用这个想法在具有共享虚拟内存的工作站集合上运行并行应用程序 [ “进一步阅读建议”10.1.8 ]。
Virtual shared memory. Several threads running on different computers can share a single address space. When a thread refers to a page that isn’t in its local RAM, the virtual memory manager can fetch the page over the network from a remote computer’s RAM. The remote virtual memory manager unmaps the page and sends the content of the page back. The Apollo DOMAIN system (mentioned in Suggestions for Further Reading 3.2.1) used this idea to make a collection of distributed computers look like one computer. Li and Hudak use this idea to run parallel applications on a collection of workstations with shared virtual memory [Suggestions for Further Reading 10.1.8].
Mach 操作系统的虚拟内存设计 [进一步阅读建议 6.1.3 ] 提供了一个示例设计,它支持许多这些特性,并被一些当前的操作系统所使用。
The virtual memory design for the Mach operating system [Suggestions for Further Reading 6.1.3] provides an example design that supports many of these features and that is used by some current operating systems.
本节的其余部分重点介绍如何使用自动多级内存管理构建大型虚拟内存。为此,设计人员必须解决一些具有挑战性的问题,但是,一旦设计完成,应用程序程序员就不必担心内存管理。除了嵌入式设备(例如,充当微波炉控制器的计算机)外,几乎所有现代计算机系统都使用虚拟内存来控制错误、实施模块化和管理多个内存级别。
The remainder of this section focuses on building large virtual memories using automatic multilevel memory management. To do so, a designer must address some challenging problems, but, once it is designed, application programmers do not have to worry about memory management. Except for embedded devices (e.g., a computer acting as the controller of a microwave oven), nearly all modern computer systems use virtual memory to contain errors, enforce modularity, and manage multiple memory levels.
假设我们有两个存储设备,一个具有READ / WRITE接口,例如 RAM,另一个具有GET / PUT接口,例如磁盘。如果处理器已经配备了虚拟内存管理器(如图5.20所示) ,则可以直接添加多级内存管理来创建单级存储。
Suppose for the moment that we have two memory devices, one that has a READ/WRITE interface, such as a RAM, and the second that has a GET/PUT interface, such as a magnetic disk. If the processor is already equipped with a virtual memory manager such as the one illustrated in Figure 5.20, it is straightforward to add multilevel memory management to create a one-level store.
基本思想是,在任何时刻,页图中列出的页面中只有部分页面实际上在 RAM 中(因为 RAM 的容量有限),其余页面在磁盘上。为了支持这个想法,我们在页图的每个条目中添加一个位,称为驻留位,在图 6.7中标记为r? 的列中。如果页面的驻留位为TRUE,则表示该页面位于 RAM 块中,并且页图中的物理地址标识该块。如果页面的驻留位为FALSE,则表示该页面当前不在任何 RAM 块中;而是在磁盘上的某个块上。
The basic idea is that at any instant, only some of the pages listed in the page map are actually in RAM (because the RAM has limited capacity) and the rest are on the disk. To support this idea, we add to each entry of the page map a single bit, called the resident bit, in the column identified as r? in Figure 6.7. If the resident bit of a page is TRUE, that means that the page is in a block of RAM and the physical address in the page map identifies that block. If the resident bit of a page is FALSE, that means that the page is not currently in any block of RAM; it is instead on some block on the disk.
图 6.7将虚拟内存管理器与多级内存管理器集成。虚拟内存管理器通常以硬件形式实现,而多级内存管理器通常以软件形式实现,作为操作系统的一部分。
Figure 6.7 Integrating a virtual memory manager with a multilevel memory manager. The virtual memory manager is typically implemented in hardware, while the multilevel memory manager is typically implemented in software as part of the operating system.
在该示例中,第 10 和第 12 页位于 RAM 中,而第 11 页仅位于磁盘上。因此,对第 10 和第 12 页的引用可以照常进行,但是如果程序尝试引用第 11 页(例如,使用LOAD指令),则虚拟内存管理器必须采取一些措施,因为处理器无法使用READ / WRITE操作引用磁盘。它采取的措施是提醒多级内存管理器,它需要使用磁盘的GET / PUT接口将该页从磁盘块复制到 RAM 中的某个块中,处理器可以直接引用它。为此,多级内存管理器(至少在概念上)维护第二个并行映射,将页码转换为磁盘块地址。实际上,实际实现可能会合并这两个映射。
In the example, pages 10 and 12 are in RAM, while page 11 is only on the disk. Thus, references to pages 10 and 12 can proceed as usual, but if the program tries to refer to page 11, for example, with a LOAD instruction, the virtual memory manager must take some action because the processor can’t refer to the disk with READ/WRITE operations. The action it takes is to alert the multilevel memory manager that it needs to use the GET/PUT interface of the disk to copy that page from the disk block into some block in the RAM where the processor can directly refer to it. For this purpose, the multilevel memory manager (at least conceptually) maintains a second, parallel map that translates page numbers to disk block addresses. In practice, real implementations may merge the two maps.
图 6.8中的伪代码(替换了第 5.4.3.1 节中TRANSLATE过程版本的第7至9行)说明了这种集成。当程序引用虚拟内存地址时,虚拟内存管理器将调用TRANSLATE,它(在执行常规域和权限检查后)在页面图中查找页码。如果请求的地址位于驻留在内存中的页面中,则管理器将按照第5 章中的方式继续操作,将虚拟地址转换为 RAM 中的物理地址。如果页面不驻留,管理器将发出页面丢失的信号。
The pseudocode of Figure 6.8 (which replaces lines 7–9 of the version of the TRANSLATE procedure of Section 5.4.3.1) illustrates the integration. When a program makes a reference to a virtual memory address, the virtual memory manager invokes TRANSLATE, which (after performing the usual domain and permission checks) looks up the page number in the page map. If the requested address is in a page that is resident in memory, the manager proceeds as it did in Chapter 5, translating the virtual address to a physical address in the RAM. If the page is not resident, the manager signals that the page is missing.
图 6.8替换第 5 章TRANSLATE程序的第 7-9行,实现多级记忆。
Figure 6.8 Replacement for lines 7–9 of procedure TRANSLATE of Chapter 5, to implement a multilevel memory.
图 6.8中的伪代码将虚拟内存管理器的操作描述为一个过程,但为了保持足够的性能,虚拟内存管理器几乎总是在硬件中实现,因为它必须转换处理器发出的每个虚拟地址。在这种基于页面的设计中,虚拟内存管理器会使用间接异常(称为缺页异常或页面错误)来中断处理器。
The pseudocode of Figure 6.8 describes the operation of the virtual memory manager as a procedure, but to maintain adequate performance a virtual memory manager is nearly always implemented in hardware because it must translate every virtual address the processor issues. With this page-based design, the virtual memory manager interrupts the processor with an indirect exception that is called a missing-page exception or a page fault.
异常处理程序检查程序计数器寄存器中的值,以确定哪条指令导致了缺页异常,然后检查内存中的该指令,以查看该指令发出的地址。接下来,它调用SEND(参见图 5.30),向多级内存管理器的端口发送包含缺失页码的请求。SEND调用ADVANCE,从而唤醒多级内存管理器的一个线程。然后,处理程序代表应用程序的线程调用AWAIT (即,使用导致异常的线程的堆栈)。AWAIT过程让出处理器。
The exception handler examines the value in the program counter register to determine which instruction caused the missing-page exception, and it then examines that instruction in memory to see what address that instruction issued. Next, it calls SEND (see Figure 5.30) with a request containing the missing page number to the port for the multilevel memory manager. SEND invokes ADVANCE, which wakes up a thread of the multilevel memory manager. Then, the handler invokes AWAIT on behalf of the application program’s thread (i.e., with the stack of the thread that caused the exception). The AWAIT procedure yields the processor.
多级内存管理器接收请求并根据需要在 RAM 块和磁盘块之间复制页面。对于每个缺页异常,多级内存管理器首先在其并行页面图中查找该页面以确定保存该页面的磁盘块的地址。接下来,它在 RAM 中定位一个未使用的块。使用这两个参数,它对保存该页面的磁盘块发出GET,将结果写入未使用的 RAM 块。然后,多级内存管理器通过将块号写入虚拟内存管理器的页面图并将驻留位更改为 TRUE 来通知虚拟内存管理器该页面在 RAM 中,并通过调用ADVANCE使遇到缺页异常的线程可运行。
The multilevel memory manager receives the request and copies pages between RAM blocks and disk blocks as they are needed. For each missing-page exception, the multilevel memory manager first looks up that page in its parallel page map to determine the address of the disk block that holds the page. Next, it locates an unused block in RAM. With these two parameters, it issues a GET for the disk block that holds the page, writing the result into the unused RAM block. The multilevel memory manager then informs the virtual memory manager about the presence of the page in RAM by writing the block number in the virtual memory manager’s page map and changing the resident bit to TRUE, and makes the thread that experienced the missing-page exception runnable by calling ADVANCE.
当该线程下次运行时,它会备份在其返回点找到的程序计数器,以便在返回到用户模式后,应用程序将重新执行遇到丢失页面的指令。由于该页面现在驻留在 RAM 中,并且多级内存管理器已更新虚拟内存管理器的映射,因此这次TRANSLATE函数将能够将虚拟地址转换为物理地址。
When that thread next runs, it backs up the program counter found in its return point so that after the return to user mode the application program will reexecute the instruction that encountered the missing page. Since that page is now resident in RAM and the multilevel memory manager has updated the mappings of the virtual memory manager, this time the TRANSLATE function will be able to translate the virtual address to a physical address.
如果 RAM 中的所有块都被页面占用,则多级内存管理器必须从 RAM 中选择一些页面并将其移除以为缺失的页面腾出空间。选择移除的页面通俗地称为牺牲品,多级内存管理器用来选择牺牲品的算法称为页面移除策略。错误的选择(例如,系统地选择移除下一次内存访问所需的页面)可能导致多级内存系统以从磁盘检索页面的速率运行,而不是以从 RAM 检索字的速率运行。实际上,利用大多数程序的一种称为局部性的属性的选择算法可以使这些程序运行时仅出现偶尔的缺失页面异常。局部性属性在第 6.2.5 节中讨论,几种不同的页面移除策略在第 6.2.6 节中讨论。
If all blocks in RAM are occupied with pages, the multilevel memory manager must select some page from RAM and remove it to make space for the missing page. The page selected for removal is known colloquially as the victim, and the algorithm that the multilevel memory manager uses to select a victim is called the page-removal policy. A bad choice (for example, systematically selecting for removal the page that will be needed by the next memory access) could cause the multilevel memory system to run at the rate that pages can be retrieved from the disk, rather than the rate that words can be retrieved from RAM. In practice, a selection algorithm that exploits a property of most programs known as locality can allow those programs to run with only occasional missing-page exceptions. The locality property is discussed in Section 6.2.5, and several different page removal policies are discussed in Section 6.2.6.
如果所选页面在 RAM 中时被修改,则多级内存管理器必须先将修改后的页面放回磁盘,然后再对新页面发出GET。因此,在最坏的情况下,缺页异常会导致两次访问磁盘:一次是将修改后的页面放回磁盘,一次是GET缺页异常处理程序请求的页面。在最好的情况下,RAM 中的页面自读取磁盘后未被修改,因此它与磁盘副本相同。在这种情况下,多级内存管理器可以简单地调整虚拟内存页面映射条目以显示该页面不再驻留,所需的磁盘访问次数恰好是GET缺页的次数。此方案在磁盘上维护每个虚拟内存页面的副本,无论该页面是否也驻留在 RAM 中,因此磁盘必须大于 RAM,并且有效虚拟内存容量等于磁盘上为虚拟内存分配的空间。
If the selected page was modified while it was in RAM, the multilevel memory manager must PUT the modified page back to the disk before issuing a GET for the new page. Thus, in the worst case, a missing-page exception results in two accesses to the disk: one to PUT a modified page back to the disk and one to GET the page requested by the missing-page exception handler. In the best case, the page in RAM has not been modified since being read from disk, so it is identical to the disk copy. In this case, the multilevel memory manager can simply adjust the virtual memory page-map entry to show that this page is no longer resident, and the number of disk accesses needed is just the one to GET the missing page. This scheme maintains a copy of every virtual memory page on the disk, whether or not that page is also resident in RAM, so the disk must be larger than the RAM and the effective virtual memory capacity is equal to the space allocated for virtual memory on the disk.
这种方案的一个问题是,它引入了有时被称为隐式 I/O 的操作。多级内存管理器执行的 I/O 操作超出了应用程序执行的操作(这些操作称为显式 I/O)。鉴于磁盘通常是 I/O 瓶颈(参见第 6.1.8 节),这些隐式 I/O 可能会降低应用程序的速度。问题集15探讨了在基于页面和基于对象的单级存储环境中与隐式 I/O 相关的一些问题。
A concern about this scheme is that it introduces what sometimes are called implicit I/Os. The multilevel memory manager performs I/O operations beyond the ones performed by the application (which are then called explicit I/Os). Given that a disk is often an I/O bottleneck (see Section 6.1.8), these implicit I/Os may risk slowing down the application. Problem set 15 explores some of the issues related to implicit I/Os in the context of a page-based and an object-based single-level store.
为了缓解缺页异常的 I/O 瓶颈,设计人员可以通过实现具有多个线程的多级内存管理器来利用并发性。当发生缺页异常时,下一个可用的多级内存管理器线程可以开始处理该缺页。该线程开始GET操作并等待GET完成。同时,线程管理器可以将处理器分配给其他线程。当GET完成时,中断会通知多级内存管理器线程,并完成缺页异常的处理。通过这种组织,多级内存管理器可以将缺页异常的处理与其他线程的计算重叠,并且可以同时处理多个缺页异常。
To mitigate the I/O bottleneck for missing-page exceptions, a designer can exploit concurrency by implementing the multilevel memory manager with multiple threads. When a missing-page exception occurs, the next available multilevel memory manager thread can start to work on that missing page. The thread begins a GET operation and waits for the GET to complete. Meanwhile, the thread manager can assign the processor to some other thread. When the GET completes, an interrupt notifies the multilevel memory manager thread and it completes processing of the missing-page exception. With this organization, the multilevel memory manager can overlap the handling of a missing-page exception with the computation of other threads, and it can handle multiple missing-page exceptions concurrently.
许多较旧的系统使用了一种完全不同、模块化程度较低的组织方式:将多级内存管理器与内核中的虚拟内存管理器集成在一起,目的是减少处理缺页异常所需的指令数量,从而提高性能。通常,集成后,多级内存管理器会在内核中的应用程序线程中运行,从而减少线程数量并避免上下文切换的成本。大多数此类系统都是几十年前设计的,当时指令数量是主要关注点。
A quite different, less modular organization is used in many older systems: integrate the multilevel memory manager with the virtual memory manager in the kernel, with the goal of reducing the number of instructions required to handle a missing-page exception, and thus improving performance. Typically, when integrated, the multilevel memory manager runs in the application thread in the kernel, thus reducing the number of threads and avoiding the cost of context switches. Most such systems were designed decades ago when instruction count was a major concern.
比较这两种组织方式,单独的多级内存管理器的模块化的一个好处是多个多级内存管理器可以轻松共存。 例如,一个多级内存管理器可以与另一个提供内存映射文件的多级内存管理器共存,前者通过读写磁盘块为应用程序提供大内存的幻觉。 这些不同的多级内存管理器可以作为单独的模块实现,而不是与虚拟内存管理器集成在一起。 将多级内存管理器与虚拟内存管理器分离是设计提示将机制与策略分离的一个例子,在侧边栏 6.5中进行了讨论。 Mach 虚拟内存系统就是现代模块化设计的一个例子 [进一步阅读建议 6.1.3 ]。
Comparing these two organizations, one benefit of the modularity of a separate multilevel memory manager is that several multilevel memory managers can easily coexist. For example, one multilevel memory manager that reads and writes blocks to a magnetic disk to provide applications with the illusion of a large memory may coexist with another multilevel memory manager that provides memory-mapped files. These different multilevel memory managers can be implemented as separate modules, as opposed to being integrated together with the virtual memory manager. Separating the multilevel memory manager from the virtual memory manager is an example of the design hint separate mechanism from policy, discussed in Sidebar 6.5. The Mach virtual memory system is an example of a modern, modular design [Suggestions for Further Reading 6.1.3].
边栏 6.5 设计提示:将机制与策略分离
Sidebar 6.5 Design Hint: Separate Mechanism from Policy
如果模块需要做出策略决策,最好将策略决策留给模块的客户,以便他们做出符合其目标的决策。如果机制和策略模块之间的接口定义良好,那么这种分离允许更改调度策略,而无需更改机制的实现。例如,可以替换页面删除策略,而无需更改处理缺页异常的机制。此外,在将缺页异常机制移植到另一个处理器时,可能必须重写缺页处理程序,但策略模块可能不需要修改。
If a module needs to make a policy decision, it is better to leave the policy decision to the clients of the module so that they can make a decision that meets their goals. If the interface between the mechanism and policy module is well defined, then this split allows the schedule policies to be changed without having to change the implementation of the mechanism. For example, one could replace the page-removal policy without having to change the mechanism for handling missing-page exceptions. Furthermore, when porting the missing-page exception mechanism to another processor, the missing-page handler may have to be rewritten, but the policy module may require no modifications.
当然,如果策略的改变需要改变机制和策略模块之间的接口,那么两个模块都必须更换。因此,遵循提示的成功取决于机制和策略模块之间的接口设计得有多好。分离机制和策略的潜在缺点是由于机制和策略模块之间的控制转移而导致的性能损失,并且如果不需要灵活性,则会增加复杂性。例如,如果一个策略始终是正确的,那么分离策略和机制可能只是不必要的复杂性。
Of course, if a change in policy requires changes to the interface between the mechanism and policy modules, then both modules must be replaced. Thus, the success of following the hint is limited by how well the interface between the mechanism and policy module is designed. The potential downsides of separating mechanism and policy are a loss in performance due to control transfers between the mechanism and policy module, and increased complexity if flexibility is unneeded. For example, if one policy is always the right one, then separating policy and mechanism may just be unnecessary complexity.
在多级内存管理的情况下,将缺失页机制与页面替换策略分开主要是为了便于移植,因为最近最少使用的页面替换策略(在第 6.2.5 节中讨论)在实践中对于大多数应用程序都运行良好。
In the case of multilevel memory management, separating the missing-page mechanism from the page replacement policy is mostly for ease of porting because the least recently used page-replacement policy (discussed in Section 6.2.5) works well in practice for most applications.
如果多级管理器作为独立于虚拟内存管理器的模块实现,那么设计人员可以选择在内核模式下运行多级管理器模块,或者在用户模式下作为单独的应用程序运行。由于许多部署的系统都是单片内核系统(参见第 5.3.6 节),设计人员通常选择在内核模式下运行多级管理器模块。在少数系统中,多级管理器作为具有自己地址空间的单独用户应用程序运行。
If the multilevel managers are implemented as separate modules from the virtual memory manager, then the designer has the choice of running the multilevel manager modules in kernel mode or as separate applications in user mode. For the same reasons that many deployed systems are monolithic kernel systems (see Section 5.3.6), designers often choose to run the multilevel manager modules in kernel mode. In a few systems, the multilevel managers run as separate user applications with their own address spaces.
需要仔细思考的一个问题是,如果多级内存管理器在其自身的过程或数据中遇到缺页异常,该怎么办。原则上,只要递归达到最低限度,递归缺页异常就不会出现问题。为了确保递归确实达到最低限度,必须确保某些必要的页面集(例如,包含中断处理程序和内核线程管理器的指令和表的页面)永远不会被选中从 RAM 中删除。通常的方法是向这些必要页面的页面映射条目添加一个标记,实际上是说“不要删除此页面”。这样标记的页面通常被称为已连接。
One question that requires some careful thought is what to do if a multilevel memory manager encounters a missing-page exception in its own procedures or data. In principle, there is no problem with recursive missing-page exceptions as long as the recursion bottoms out. To ensure that the recursion does bottom out, it is necessary to make sure that some essential set of pages (for example, the pages containing the instructions and tables of the interrupt handler and the kernel thread manager) is never selected for removal from RAM. The usual method is to add a mark to the page-map entries for those essential pages saying, in effect, “Don’t remove this page.” Pages so marked are commonly said to be wired down.
多级存储器是常见的工程实践。从处理器的角度来看,存储的指令和数据会遍历一些存储设备金字塔,如图6.6所示。但在分析或构建多级存储器时,我们会将每对相邻的级别单独分析为两级存储器系统,然后堆叠几个两级存储器系统。(这样做的一个原因是它似乎有效。另一个原因是,还没有人找到更令人满意的方法来将三级或更多级存储器作为单个系统进行分析或管理。)
Multilevel memories are common engineering practice. From the processor perspective, stored instructions and data traverse some pyramid of memory devices such as the one that was illustrated in Figure 6.6. But when analyzing or constructing a multilevel memory, we do so by analyzing each adjacent pair of levels individually as a two-level memory system, and then stacking the several two-level memory systems. (One reason for doing it this way is that it seems to work. Another is that no one has yet figured out a more satisfactory way to analyze or manage a three- or more-level memory as a single system.)
在两级内存系统中充当快速级的设备称为主设备,充当慢速级的设备称为辅助设备。在虚拟内存系统中,主设备通常是某种形式的 RAM;辅助设备可以是较慢的 RAM 或磁盘。Web 浏览器通常使用本地磁盘作为缓存,用于保存远程 Web 服务的页面。在这种情况下,主设备是磁盘;远程服务是辅助设备,它本身可以使用磁盘进行存储。本节其余部分描述的多级内存管理算法适用于这两种不同的配置以及许多其他配置。
Devices that function as the fast level in a two-level memory system are called primary devices, and devices that function as the slow level are called secondary devices. In virtual memory systems, the primary device is usually some form of RAM; the secondary device can be either a slower RAM or a magnetic disk. Web browsers typically use the local disk as a cache that holds pages of remote Web services. In this case, the primary device is a magnetic disk; the remote service is the secondary device, which may itself use magnetic disks for storage. The multilevel memory management algorithms described in the remainder of this section apply to both of these different configurations, and many others.
缓存和虚拟内存是两种类似的多级内存管理器。事实上,它们非常相似,唯一的区别在于它们为内存单元提供的名称空间:
A cache and a virtual memory are two similar kinds of multilevel memory managers. They are so similar, in fact, that the only difference between them is in the name space they provide for memory cells:
The user of a cache identifies memory cells using the name space of the secondary memory device.
虚拟存储器的用户使用主存储器设备的名称空间来识别存储器单元。
The user of a virtual memory identifies memory cells using the name space of the primary memory device.
除了这种差异之外,虚拟内存和缓存的设计人员从相同的可能性范围中选择多级内存管理策略。
Apart from that difference, designers of virtual memories and caches choose policies for multilevel memory management from the same range of possibilities.
图 6.6中的金字塔通常由应用程序明确管理的最高级别、某些级别的缓存设计以及其他级别的虚拟内存设计来实现。例如,包含图中所有六个级别的多级内存系统可能组织成如下形式:
The pyramid of Figure 6.6 is typically implemented with the highest level explicitly managed by the application, a cache design at some levels and a virtual memory design at other levels. For example, a multilevel memory system that includes all six levels of the figure might be organized something like the following:
1.在最高级别,处理器的寄存器是主要设备,而内存系统的其余部分是次要设备。应用程序(由编译器代码生成器构建)明确地将寄存器加载到内存系统的其余部分,并将寄存器存储到内存系统的其余部分。
1. At the highest level, the registers of the processor are the primary device, and the rest of the memory system is the secondary device. The application program (as constructed by the compiler code generator) explicitly loads and stores the registers to and from the rest of the memory system.
2.当处理器向内存系统的其余部分发出READ或WRITE时,它会将主内存名称空间中的名称作为参数提供,但此名称将发送到与处理器位于同一芯片上的主内存设备。由于名称来自较低级别的主内存名称空间,因此此级别的内存将作为缓存进行管理,通常称为“1 级缓存”或“L1 缓存”。
2. When the processor issues a READ or WRITE to the rest of the memory system, it provides as an argument a name from the main memory name space, but this name goes to a primary memory device located on the same chip as the processor. Since the name is from the lower level main memory name space, this level of memory is being managed as a cache, commonly known as a “level 1 cache” or “L1 cache”.
3.如果在 1 级缓存中找不到命名单元,则多级内存管理器会查找其辅助内存(片外内存设备),但再次使用主内存名称空间中的名称。因此,片外内存是缓存的另一个示例,称为“2 级缓存”或“L2 缓存”。
3. If the named cell is not found in the level 1 cache, a multilevel memory manager looks in its secondary memory, an off-chip memory device, but again using the name from the main memory name space. The off-chip memory is thus another example of a cache, this one known as a “level 2 cache” or “L2 cache”.
4.2 级缓存现在是主设备,并且如果在那里找不到命名的内存单元,则下一个较低的多级管理器(管理 2 级/主内存对的管理器)将在其辅助设备(主内存)中查找,仍然使用来自主内存名称空间的名称。
4. The level 2 cache is now the primary device, and if the named memory cell is not found there, the next lower multilevel manager (the one that manages the level 2/main memory pair) looks in its secondary device—the main memory—still using the name from the main memory name space.
5.在下一级,主内存是主要设备。如果寻址单元不在主内存中,虚拟内存管理器将调用下一级多级内存管理器(第6.2.3 节中描述的多级内存管理器,用于管理主内存和磁盘内存之间的移动),但仍使用主内存名称空间中的名称。多级内存管理器将此名称转换为磁盘块地址。
5. At the next level, the main memory is the primary device. If an addressed cell is not in main memory, a virtual memory manager invokes the next lower level multilevel memory manager (the one described in Section 6.2.3, that manages movement between main and disk memory) but still using the name from the main memory name space. The multilevel memory manager translates this name to a disk block address.
6.该序列可能会继续向下一层;如果在(主)本地磁盘上找不到磁盘块,则另一个多级内存管理器可能会从某个远程(辅助)系统检索它。在某些系统中,最后一对内存被管理为缓存,而在其他系统中则被管理为虚拟内存。
6. The sequence may continue down another layer; if the disk block is not found on the (primary) local disk, yet another multilevel memory manager may retrieve it from some remote (secondary) system. In some systems, this last memory pair is managed as a cache, and in others as a virtual memory.
显然,上述示例只是多级内存设计人员可以实现的众多可能性之一。
It should be apparent that the above example is just one of a vast range of possibilities open to the multilevel memory designer.
自动管理的多级内存系统是否能表现良好并不明显。可接受性能的基本要求是,存储在内存中的所有信息项的使用频率不能相同。如果每个项的使用频率相同,则多级内存的性能就不可能好,因为整个内存的运行速度大约与最慢的内存组件的速度相同。为了说明这种影响,请考虑一个两级内存系统。两级内存的平均延迟为:
It is not obvious that an automatically managed multilevel memory system should perform well. The basic requirement for acceptable performance is that all information items stored in the memory must not have equal frequency of use. If every item is used with equal frequency, then a multilevel memory cannot have good performance, since the overall memory will operate at approximately the speed of the slowest memory component. To illustrate this effect, consider a two-level memory system. The average latency of a two-level memory is:
(6.2)
(6.2)
术语R hit (称为命中率) 是在主设备中找到项目的频率,而R miss 是 (1 − R hit )。此公式是公式 6.1的直接应用(见第 6.1 节),该公式给出了具有快速和慢速路径的系统的平均性能。此处的快速路径是指主设备,而慢速路径是指辅助设备。
The term Rhit (known as the hit ratio) is the frequency with which items are found in the primary device, and Rmiss is (1 − Rhit ). This formula is a direct application of Equation 6.1, (in Section 6.1) which gives the average performance of a system with a fast and slow path. Here the fast path is a reference to the primary device, while the slow path is a reference to the secondary device.
如果对主设备和辅助设备的每个单元的访问频率相同,则平均延迟将与每个设备的单元数量成正比:
If accesses to every cell of the primary and secondary devices were of equal frequency, then the average latency would be proportional to the number of cells of each device:
(6.3)
(6.3)
其中S是存储设备的容量,T是其平均延迟。在多级存储器中,通常和
(例如,以 RAM 作为主存储器,以磁盘作为辅助存储器),在这种情况下,第一项比第二项小得多,第二项的系数趋近于 1,并且
。因此,如果对主单元和次单元的每个单元的访问可能性相同,则多级存储器不会提供任何性能优势。
where S is the capacity of a memory device and T is its average latency. In a multilevel memory, it is typical that and (as, for example, with RAM for primary memory and magnetic disk for secondary memory), in which case the first term is much smaller than the second, the coefficient of the second term approaches 1, and . Thus, if accesses to every cell of primary and secondary are equally likely, a multilevel memory doesn’t provide any performance benefit.
另一方面,如果某些存储项的使用频率明显高于其他存储项的使用频率,即使时间很短,自动管理多级内存也是可行的。例如,如果 99% 的访问以某种方式指向较快的内存,而只有 1% 的访问指向较慢的内存,则平均延迟将是:
On the other hand, if the frequency of use of some stored items is significantly higher than the frequency of use of other stored items, even for a short time, automatically managed multilevel memory becomes feasible. For example, if, somehow, 99% of accesses were directed to the faster memory and only 1% to the slower memory, then the average latency would be:
(6.4)
(6.4)
因此,如果主设备是具有 1 纳秒延迟的 L2 缓存,而辅助设备是具有 10 纳秒延迟的主存储器,则平均延迟变为 0.99 + 0.10 = 1.09 纳秒,这使得容量等于主存储器的复合存储器几乎与 L2 缓存一样快。第二个例子,如果主设备是具有 10 纳秒延迟的主存储器,而辅助设备是平均延迟为 10 毫秒的磁盘,则多级存储器的平均延迟为
Thus if the primary device is L2 cache with 1 nanosecond latency and the secondary device is main memory with 10 nanoseconds latency, the average latency becomes 0.99 + 0.10 = 1.09 nanoseconds, which makes the composite memory, with a capacity equal to that of the main memory, nearly as fast as the L2 cache. For a second example, if the primary device is main memory with 10 nanoseconds latency and the secondary device is magnetic disk with average latency of 10 milliseconds, the average latency of the multilevel memory is
该延迟远大于 10 纳秒的主内存延迟,但也远小于 10 毫秒的辅助内存延迟。本质上,多级内存只是利用了针对常见情况的设计提示优化。
That latency is substantially larger than the 10 nanosecond primary memory latency, but it is also much smaller than the 10 millisecond secondary memory latency. In essence, a multilevel memory just exploits the design hint optimize for the common case.
大多数应用程序的表现都不太好,以至于人们无法识别一组静态信息,这些信息既足够小,可以放入主设备中,又非常集中,以至于 99% 的内存引用都指向它。然而,在许多情况下,大多数内存引用在相当长的一段时间内都指向一小组地址。随着应用程序的进展,访问集中的区域会发生变化,但其大小通常仍然很小。这种将访问集中到一个小而不断变化的局部性使得自动管理的多级内存系统成为可能。具有这种访问集中性的应用程序被称为具有引用局部性。
Most applications are not so well behaved that one can identify a static set of information that is both small enough to fit in the primary device and for which reference is so concentrated that it is the target of 99% of all memory references. However, in many situations most memory references are to a small set of addresses for significant periods of time. As the application progresses, the area of concentration of access shifts, but its size still typically remains small. This concentration of access into a small but shifting locality is what makes an automatically managed multilevel memory system feasible. An application that exhibits such a concentration of accesses is said to have locality of reference.
分析一下这种情况,我们可以认为正在运行的应用程序会生成一个虚拟地址流,称为引用字符串。引用字符串可以通过两种方式表现出引用的局部性:
Analyzing the situation, we can think of a running application as generating a stream of virtual addresses, known as the reference string. A reference string can exhibit locality of reference in two ways:
Temporal locality: the reference string contains several closely spaced references to the same address.
Spatial locality: the reference string contains several closely spaced references to adjacent addresses.
自动管理的多级内存系统可以利用时间局部性,方法是将最近出现在参考字符串中的内存单元保留在主设备中,从而应用推测。它可以利用空间局部性,方法是将与最近出现在参考字符串中的内存单元相邻的内存单元移入主设备中,这是推测和批处理的结合(因为向辅助设备发出 GET可以检索大量数据,这些数据块可能会占用主设备中的许多相邻内存单元)。
An automatically managed multilevel memory system can exploit temporal locality by keeping in the primary device those memory cells that appeared in the reference string recently—thus applying speculation. It can exploit spatial locality by moving into the primary device memory cells that are adjacent to those that have recently appeared in the reference string—a combination of speculation and batching (because issuing a GET to a secondary device can retrieve a large block of data that can occupy many adjacent memory cells in the primary device).
应用程序展现引用局部性的方式有无数种:
There are endless ways in which applications exhibit locality of reference:
程序以指令序列的形式编写。大多数情况下,下一条指令存储在物理上与上一条指令相邻的内存单元中,从而产生空间局部性。此外,应用程序经常执行循环,这意味着将重复引用相同的指令,从而产生时间局部性。在循环、条件测试和跳转之间,通常会看到许多指令引用在较长时间内指向应用程序所有指令的一小部分。此外,根据条件结构,应用程序的大部分可能根本不会执行。
Programs are written as a sequence of instructions. Most of the time, the next instruction is stored in the memory cell that is physically adjacent to the previous instruction, thus creating spatial locality. In addition, applications frequently execute a loop, which means there will be repeated references to the same instructions, creating temporal locality. Between loops, conditional tests, and jumps, it is common to see many instruction references directed to a small subset of all the instructions of an application for an extended time. In addition, depending on the conditional structure, large parts of an application program may not be exercised at all.
数据结构通常被组织成这样:对结构中一个组件的引用更有可能引用物理上相邻的组件。数组就是一个例子;对第一个元素的引用很可能紧接着对第二个元素的引用。同样,如果应用程序检索一条记录的一个字段,它很可能很快就会检索同一记录的另一个字段。这些例子中的每一个都创建了空间局部性。
Data structures are typically organized so that a reference to one component of the structure makes references to physically nearby components more likely. Arrays are an example; reference to the first element is likely to be followed shortly by reference to the second. Similarly, if an application retrieves one field of a record, it will likely soon retrieve another field of the same record. Each of these examples creates spatial locality.
信息处理应用程序通常按顺序处理文件。例如,银行审计程序可能按物理存储顺序逐个检查账户(创建空间局部性),并可能对每个账户执行多个操作(创建时间局部性)。
Information processing applications typically process files sequentially. For example, a bank audit program may examine accounts one by one in physical storage order (creating spatial locality) and may perform multiple operations on each account (creating temporal locality).
虽然大多数应用程序自然地表现出大量的引用局部性,但在一定程度上,这一概念也体现了自我实现预言的元素。应用程序程序员通常知道多级内存管理被广泛使用,因此他们试图编写表现出良好引用局部性的程序,以期获得更好的性能。
Although most applications naturally exhibit a significant amount of locality of reference, to a certain extent the concept also embodies an element of self-fulfilling prophecy. Application programmers are usually aware that multilevel memory management is widely used, so they try to write programs that exhibit good locality of reference in the expectation of better performance.
如果我们观察一个表现出引用局部性的应用程序,在短时间内,该应用程序仅引用整个内存单元集合的一个子集。应用程序在给定间隔 Δ t内的引用集称为其工作集。在这样一个间隔中,应用程序可能会执行对一组相关数据项进行操作的过程或循环,导致大多数引用转到过程的文本和该组数据项。然后,应用程序可能会调用另一个过程,导致大多数引用转到该过程的文本和相关数据项。因此,应用程序的工作集会随着时间的推移而增长、缩小和变化。
If we look at an application that exhibits locality of reference, in a short time the application refers to only a subset of the total collection of memory cells. The set of references of an application in a given interval Δt is called its working set. In one such interval, the application may execute a procedure or loop that operates on a group of related data items, causing most references to go to the text of the procedure and that group of data items. Then, the application might call another procedure, causing most references to go to the text and related data items of that procedure. The working set of an application thus grows, shrinks, and shifts with time.
如果某一时刻应用程序的当前工作集完全存储在主内存设备中,则该应用程序将不会引用辅助设备。另一方面,如果应用程序的当前工作集大于主设备,则该应用程序(或至少是多级内存管理器)将不得不至少引用辅助设备,因此运行速度会更慢。工作集比主设备大得多的应用程序可能会导致数据在主设备和辅助设备之间反复来回移动,这种现象称为抖动。设计目标是避免或至少尽量减少抖动。
If at some instant the current working set of an application is entirely stored in the primary memory device, the application will make no references to the secondary device. On the other hand, if the current working set of an application is larger than the primary device, the application (or at least the multilevel memory manager) will have to make at least some references to the secondary device, and it will therefore run more slowly. An application whose working set is much larger than the primary device is likely to cause repeated movement of data back and forth between the primary and secondary devices, a phenomenon called thrashing. A design goal is to avoid, or at least minimize, thrashing.
掌握了引用局部性和工作集的概念后,我们现在可以研究一些常见的多级内存管理策略的行为,这些策略是选择将哪些存储对象放在主设备中、将哪些存储对象放在辅助设备中以及何时将存储对象从一个设备移动到另一个设备的算法。为了使讨论更加具体,我们将在具有两个级别的虚拟内存系统的背景下分析多级内存管理策略:RAM(主设备)和磁盘(辅助设备),其中存储的对象是大小均匀的页面。但是,重要的是要记住,相同的分析适用于任何多级内存系统,无论是组织为缓存还是虚拟内存,具有统一或可变大小的对象,以及任何类型的主设备和辅助设备。
Equipped with the concepts of locality of reference and working set, we can now examine the behavior of some common multilevel memory management policies, algorithms that choose which stored objects to place in the primary device, which to place in the secondary device, and when to move a stored object from one device to the other. To make the discussion concrete, we will analyze multilevel memory management policies in the context of a virtual memory system with two levels: RAM (the primary device) and a magnetic disk (the secondary device), in which the stored objects are pages of uniform size. However, it is important to keep in mind that the same analysis applies to any multilevel memory system, whether organized as a cache or a virtual memory, with uniform or variable-sized objects, and any variety of primary and secondary devices.
Each level of a multilevel memory system can be characterized by four items:
指向该级别的引用字符串。在虚拟内存系统中,主设备看到的引用字符串是从指令和数据的虚拟地址中提取的页码序列,按照应用程序引用它们的顺序排列。辅助设备看到的引用字符串是主设备中未命中的页码序列。因此,辅助设备引用字符串是主设备引用字符串的缩短版本。
The string of references directed to that level. In a virtual memory system, the reference string seen by the primary device is the sequence of page numbers extracted from virtual addresses of both instructions and data, in the order that the application makes references to them. The reference string seen by the secondary device is the sequence of page numbers that were misses in the primary device. The secondary device reference string is thus a shortened version of the primary device reference string.
该级别的引入策略。在虚拟内存系统中,主设备的通常引入策略是按需的:每当使用页面时,如果页面尚未存在,则将其引入主设备。唯一剩下的策略决定是是否带入一些相邻页面。在两级内存系统中,不需要为辅助设备制定引入策略。
The bring-in policy for that level. In a virtual memory system, the usual bring-in policy for the primary device is on-demand: whenever a page is used, bring it to the primary device if it is not already there. The only remaining policy decision is whether or not to bring along some adjacent pages. In a two-level memory system there is no need for a bring-in policy for the secondary device.
该级别的移除策略。在虚拟内存系统的主设备中,此策略选择要驱逐的页面(牺牲品)以便为新页面腾出空间。同样,在两级内存系统中,不需要为辅助设备制定移除策略。
The removal policy for that level. In the primary device of a virtual memory system, this policy chooses a page to evict (the victim) to make room for a new page. Again, in a two-level memory system there is no need for a removal policy for the secondary device.
级别的容量。在虚拟内存系统中,一级的容量是一级内存块的数量,二级的容量是二级内存块的数量。由于二级内存通常包含每一页的副本,因此多级内存系统的容量等于二级设备的容量。
The capacity of the level. In a virtual memory system, the capacity of the primary level is the number of primary memory blocks, and the capacity of the secondary level is the number of secondary memory blocks. Since the secondary memory normally contains a copy of every page, the capacity of the multilevel memory system is equal to the capacity of the secondary device.
多级内存系统的目标是让主设备在其引用字符串中提供尽可能多的引用,从而最大限度地减少辅助设备引用字符串中的引用数。在多级内存管理器的示例中,此目标意味着最大限度地减少缺页异常的数量。人们可能认为增加主设备的容量将保证缺页异常的数量减少(或至少不会增加)。令人惊讶的是,这种期望并不总是正确的。例如,考虑先进先出 (FIFO) 页面删除策略,其中选择删除的页面是主设备中时间最长的页面。 (也就是说,最先被带入的页面将是第一个被移除的页面。这种策略很有吸引力,因为通过将主设备的页面作为循环缓冲区进行管理,它很容易实现。)如果引用字符串为 0 1 2 3 0 1 4 0 1 2 3 4,并且主设备开始为空,则容量为三页的主设备将遇到九次缺页异常,而容量为四页的主设备将遇到十次缺页异常,如表 6.1和表6.2所示:
The goal of a multilevel memory system is to have the primary device serve as many references in its reference string as possible, thereby minimizing the number of references in the secondary device reference string. In the example of the multilevel memory manager, this goal means to minimize the number of missing-page exceptions. One might expect that increasing the capacity of the primary device would guarantee a reduction (or at least not an increase) in the number of missing-page exceptions. Surprisingly, this expectation is not always true. As an example, consider the first-in, first-out (FIFO) page-removal policy, in which the page selected for removal is the one that has been in the primary device the longest. (That is, the first page that was brought in will be the first page to be removed. This policy is attractive because it is easy to implement by managing the pages of the primary device as a circular buffer.) If the reference string is 0 1 2 3 0 1 4 0 1 2 3 4, and the primary device starts empty, then a primary device with a capacity of three pages will experience nine missing-page exceptions, while a primary device with a capacity of four pages will experience ten missing-page exceptions, as shown in Tables 6.1 and 6.2:
表 6.1.三页主设备的 FIFO 页面删除策略
Table 6.1. FIFO Page-Removal Policy with a Three-Page Primary Device
表 6.2.四页主设备的 FIFO 页面删除策略
Table 6.2. FIFO Page-Removal Policy with a Four-Page Primary Device
这种主设备容量较大时缺失页面异常数量的意外增加称为Belady 异常,以首次报告该异常的论文作者命名。Belady 异常在实践中并不常见,但它表明,在比较页面删除策略时,看似更好的策略在主设备容量不同的情况下实际上可能会更糟。我们将看到,简化分析的一种方法是避免使用可能表现出 Belady 异常的策略。
This unexpected increase of missing-page exception numbers with a larger primary device capacity is called Belady’s anomaly, named after the author of the paper that first reported it. Belady’s anomaly is not commonly encountered in practice, but it suggests that when comparing page-removal policies, what appears to be a better policy might actually be worse with a different primary device capacity. As we shall see, one way to simplify analysis is to avoid policies that can exhibit Belady’s anomaly.
多级内存管理策略的目标是选择要删除的页面,以使将来的缺页异常数量最少。如果我们知道未来的引用字符串,我们可以提前查看即将触及哪些页面。最佳策略将始终选择删除最长时间不需要的页面。不幸的是,这种政策是无法实现的,因为它需要预测未来。但是,如果我们运行一个程序并跟踪其引用字符串,之后我们可以查看该引用字符串以确定如果使用该最佳策略将发生多少缺页异常。然后可以将该结果与实际使用的策略进行比较,以确定它与最佳策略的接近程度。这种无法实现的策略称为最佳(OPT)页面删除策略。表 6.3和6.4显示了将 OPT 页面删除策略应用于与之前相同的引用字符串的结果。
The objective of a multilevel memory management policy is to select for removal the page that will minimize the number of missing-page exceptions in the future. If we knew the future reference string, we could look ahead to see which pages are about to be touched. The optimal policy would always choose for removal the page not needed for the longest time. Unfortunately, this policy is unrealizable because it requires predicting the future. However, if we run a program and keep track of its reference string, afterwards we can review that reference string to determine how many missing-page exceptions would have occurred if we had used that optimal policy. That result can then be compared with the policy that was actually used to determine how close it is to the optimal one. This unrealizable policy is known as the optimal (OPT) page-removal policy. Tables 6.3 and 6.4 show the result of the OPT page-removal policy applied to the same reference string as before.
表 6.3.具有三页主设备的 OPT 页面删除策略
Table 6.3. The OPT Page-Removal Policy with a Three-Page Primary Device
表 6.4.四页主设备的 OPT 页面删除策略
Table 6.4. The OPT Page-Removal Policy with a Four-Page Primary Device
从引入的页面数量可以看出,至少对于这个引用串,OPT策略比FIFO更好。另外,至少对于这个引用串,主设备容量越大,OPT策略越好。
It is apparent from the number of pages brought in that, at least for this reference string, the OPT policy is better than FIFO. In addition, at least for this reference string, the OPT policy gets better when the primary device capacity is larger.
因此,设计目标就变成了设计出能够 (1) 避免 Belady 异常、(2) 命中率不比最优策略差太多、(3) 易于机械实现的页面删除算法。
The design goal thus becomes to devise page-removal algorithms that (1) avoid Belady’s anomaly, (2) have hit ratios not much worse than the optimal policy, and (3) are mechanically easy to implement.
一些易于实现的页面移除策略在广泛的应用程序类别中具有平均性能,这些性能足够接近最佳策略,因此是有效的。一种流行的策略是最近最少使用 (LRU) 页面移除策略。LRU基于以下观察:通常,最近的过去可以很好地预测近期的未来。LRU 预测页面使用时间越长,再次需要它的可能性就越小。因此,LRU 选择主设备中最长时间未使用的页面(即“最近最少使用”页面)作为牺牲品。让我们看看 LRU 在处理我们的示例引用字符串时的表现如何:
Some easy-to-implement page-removal policies have an average performance on a wide class of applications that is close enough to the optimal policy to be effective. A popular one is the least-recently-used (LRU) page-removal policy. LRU is based on the observation that, more often than not, the recent past is a fairly good predictor of the immediate future. The LRU prediction is that the longer the time since a page has been used, the less likely it will be needed again soon. So LRU selects as its victim the page in the primary device that has not been used for the longest time (that is, the “least-recently-used” page). Let’s see how LRU fares when it tackles our example reference string:
表 6.5.三页主设备的 LRU 页面删除策略
Table 6.5. The LRU Page-Removal Policy with a Three-Page Primary Device
表 6.6.四页主设备的 LRU 页面删除策略
Table 6.6. The LRU Page-Removal Policy with a Four-Page Primary Device
对于此引用字符串,对于大小为 4 的主内存设备,LRU 优于 FIFO,但不如 OPT 策略。对于 LRU 和 OPT 策略,页面移动次数随主设备大小单调递减;这两种算法避免了 Belady 异常,原因并不明显,将在第 6.2.7 节中解释。
For this reference string, LRU is better than FIFO for a primary memory device of size 4 but not as good as the OPT policy. And for both LRU and the OPT policy the number of page movements is monotonically non-decreasing with primary device size; these two algorithms avoid Belady’s anomaly, for a non-obvious reason that will be explained in Section 6.2.7.
大多数有用的算法都要求新页面是唯一移入的页面,并且只有一个页面移出。具有此属性的算法称为需求算法。FIFO、LRU 和一些实现 OPT 策略的算法都是需求算法。如果任何其他页面移入主内存,则该算法被称为使用预分页,这是第 6.2.9 节的主题之一。
Most useful algorithms require that the new page be the only page that moves in and that only one page move out. Algorithms that have this property are called demand algorithms. FIFO, LRU, and some algorithms that implement the OPT policy are demand algorithms. If any other page moves in to primary memory, the algorithm is said to use prepaging, one of the topics of Section 6.2.9.
如上所示,LRU 不如 OPT 策略好。因为它着眼于历史而不是未来,所以它有时会抛出完全错误的页面(四页内存中引用 #11 处的页面移动就是一个例子)。举一个更极端的例子,一个从上到下运行的程序通过比主设备更大的虚拟内存,将始终驱逐完全错误的页面。考虑一个容量为四页的主设备,它是包含五页的虚拟内存的一部分,使用 LRU 进行管理(字母“F”表示此引用导致缺页异常):
As seen above, LRU is not as good as the OPT policy. Because it looks at history rather than the future, it sometimes throws out exactly the wrong page (the page movement at reference #11 in the four-page memory provides an example). For a more extreme example, a program that runs from top to bottom through a virtual memory that is larger than the primary device will always evict exactly the wrong page. Consider a primary device with capacity of four pages that is part of a virtual memory that contains five pages being managed with LRU (the letter “F” means that this reference causes a missing-page exception):
如果应用程序反复从一端到另一端循环虚拟内存,则对页面的每次引用都会导致页面移动。如果我们从一个空的主设备开始,对页面 0 到 3 的引用将导致页面移动。对页面 4 的引用也会导致页面移动,其中 LRU 将删除页面 0,因为页面 0 最近使用得最少。对页面 0 的下一次引用也会导致页面移动,这会导致 LRU 删除页面 1,因为它最近使用得最少。因此,对页面 1 的下一次引用将导致页面移动,替换页面 2,依此类推。简而言之,每次访问页面都会导致页面移动。
If the application repeatedly cycles through the virtual memory from one end to the other, each reference to a page will result in a page movement. If we start with an empty primary device, references to page 0 through 3 will result in page movements. The reference to page 4 will also result in a page movement, in which LRU will remove page 0, since page 0 has been used least recently. The next reference, to page 0, will also result in a page movement, which leads LRU to remove page 1, since it has been used least recently. As a consequence, the next reference, to page 1, will result in a page movement, replacing page 2, and so on. In short, every access to a page will result in a page movement.
对于这样的应用程序,最好使用最近使用 (MRU) 页面删除策略。MRU 选择最近使用的页面作为牺牲品。
For such an application, a most-recently-used (MRU) page-removal policy would be better. MRU chooses as the victim the most recently used page.
让我们看看 MRU 在给 LRU 带来很多麻烦的故意示例中的表现如何:
Let’s see how MRU fares on the contrived example that gave LRU so much trouble:
对页面 0 到 3 的初始引用会导致页面移动,从而填充空的主设备。对页面 4 的第一次引用也会导致页面移动,替换页面 3,因为页面 3 最近被使用过。下一个对页面 0 的引用不会导致缺页异常,因为页面 0 仍在主设备中。同样,对页面 1 和 2 的后续引用不会导致页面移动。对页面 3 的第二次引用将导致页面移动,替换页面 2,但随后将有三个引用不需要页面移动。因此,使用 MRU 页面删除策略,我们设计的应用程序将比使用 LRU 页面删除策略时遇到更少的缺页异常:一旦处于稳定状态,MRU 将导致每次循环迭代移动一次页面。
The initial references to pages 0 through 3 result in page movements that fill the empty primary device. The first reference to page 4 will also result in a page movement, replacing page 3, since page 3 has been used most recently. The next reference, to page 0, will not result in a missing-page exception since page 0 is still in the primary device. Similarly, the succeeding references to page 1 and 2 will not result in page movements. The second reference to page 3 will result in a page movement, replacing page 2, but then there will be three references that do not require page movements. Thus, with the MRU page-removal policy, our contrived application will experience fewer missing-page exceptions than with the LRU page-removal policy: once in steady state, MRU will result in one page movement per loop iteration.
然而,在实践中,LRU 出奇地稳健,因为过去的引用通常可以合理地预测未来的引用;MRU 表现更好的例子并不常见。LRU 效果良好的第二个原因是程序员假设多级内存系统使用 LRU 或某种近似值作为删除策略,并且他们设计的程序在该策略下运行良好。
In practice, however, LRU is surprisingly robust because past references frequently are a reasonable predictor of future references; examples in which MRU does better are uncommon. A secondary reason why LRU works well is that programmers assume that the multilevel memory system uses LRU or some close approximation as the removal policy and they design their programs to work well under that policy.
一旦设计出了包含多级内存系统的整体系统架构,设计人员就需要决定两件会影响性能的事情:
Once an overall system architecture that includes a multilevel memory system has been laid out, the designer needs to decide two things that will affect performance:
这两个决策可以(在实践中经常如此)通过分析来支持,该分析首先检测硬件处理器或处理器模拟器以维护正在运行的程序的引用字符串的跟踪。在收集了几个将在设计系统上运行的典型程序的此类跟踪之后,就可以使用这些跟踪来模拟具有各种主设备大小和页面删除策略的多级存储器的操作。多级存储器性能的通常衡量标准是命中率,因为它是一个纯数字,其值仅取决于主设备的大小和页面删除策略。给定主存储器和辅助存储器设备的命中率和延迟,就可以使用公式 6.2立即估算出多级存储器系统的性能。
These two decisions can be—and in practice often are—supported by an analysis that begins by instrumenting a hardware processor or an emulator of a processor to maintain a trace of the reference string of a running program. After collecting several such traces of typical programs that are to be run on the system under design, these traces can then be used to simulate the operation of a multilevel memory with various primary device sizes and page-removal policies. The usual measure of a multilevel memory’s performance is the hit ratio because it is a pure number whose value depends only on the size of the primary device and the page-removal policy. Given the hit ratio and the latency of the primary and secondary memory devices, one can immediately estimate the performance of the multilevel memory system by using Equation 6.2.
20 世纪 70 年代初,IBM 公司的一个研究小组开发了一种快速进行此类模拟的方法,用于计算一类页面删除策略的命中率。如果我们更仔细地查看表6.3和6.4中的“主设备内容”行,我们会注意到,最佳策略始终在三页内存中保留在四页内存中页面的子集。但在 FIFO表 6.1和6.2中,在时间 8、9、11 和 12,此子集属性不成立。这种差异并非偶然;它是理解如何避免 Belady 异常以及如何快速分析引用字符串以查看特定策略对任何主设备大小的执行情况的关键。
In the early 1970s, a team of researchers at the IBM Corporation developed a rapid way of doing such simulations to calculate hit ratios for one class of page-removal policies. If we look more carefully at the “primary device contents” rows of Tables 6.3 and 6.4, we notice that at all times the optimal policy keeps in the three-page memory a subset of the pages that it keeps in the four-page memory. But in FIFO Tables 6.1 and 6.2, at times 8, 9, 11, and 12, this subset property does not hold. This difference is no accident; it is the key to understanding how to avoid Belady’s anomaly and how to rapidly analyze a reference string to see how a particular policy will perform for any primary device size.
如果页面删除策略能够以某种方式始终针对每种可能的主设备容量维持此子集属性,则较大的主设备永远不会比较小的主设备具有更多的缺失页面异常。此外,如果我们考虑容量为n页的主设备和容量为n + 1 页的主设备,则子集属性可确保较大的主设备恰好包含较小主设备中没有的一个页面。对每个可能的主设备大小n重复此论证,我们会看到子集属性会创建多级内存系统的所有页面的全排序。例如,假设大小为 1 的内存包含页面A。大小为 2 的内存也必须包含页面A,再加上一个其他页面,可能是B。大小为 3 的内存必须包含页面A和B加上一个其他页面,可能是C。因此,子集属性创建全排序 { A,B,C }。此全排序与为主内存设备选择的实际容量无关。
If a page-removal policy can somehow maintain this subset property at all times and for every possible primary device capacity, then a larger primary device can never have more missing-page exceptions than a smaller one. Moreover, if we consider a primary device of capacity n pages and a primary device of capacity n + 1 pages, the subset property ensures that the larger primary device contains exactly one page that is not in the smaller primary device. Repeating this argument for every possible primary device size n, we see that the subset property creates a total ordering of all the pages of the multilevel memory system. For example, suppose a memory of size 1 contains page A. A memory of size 2 must also contain page A, plus one other page, perhaps B. A memory of size 3 must then contain pages A and B plus one other page, perhaps C. Thus, the subset property creates the total ordering {A, B, C}. This total ordering is independent of the actual capacity chosen for the primary memory device.
IBM 研究团队将这种排序称为“堆栈”(该词的使用与下推堆栈无关),而保持子集属性的页面移除策略从此被称为堆栈算法。尽管对子集属性的要求限制了算法的范围,但是该类算法中仍有几种不同、有趣且实用的算法。特别是 OPT 策略、LRU 和 MRU 都是堆栈算法。使用堆栈算法时,虚拟内存系统仅保留主设备中排序最前面的页面,将剩余页面转移到辅助设备。因此,如果m < n ,则容量为m 的主设备中的页面集始终是容量为n的主设备中的页面集的子集。因此,较大的内存将始终能够满足较小内存可以满足的所有请求,如果幸运的话,还可以满足一些其他请求。换句话说,全序确保如果某个引用命中了大小为n的主存储器,那么它也会命中每个大于n 的存储器。当使用堆栈算法时,主设备中的命中率因此保证是容量增加的非递减函数。Belady 异常不会发生。
The IBM research team called this ordering a “stack” (in a use of that word that has no connection with push-down stacks), and page-removal policies that maintain the subset property have since become known as stack algorithms. Although requiring the subset property constrains the range of algorithms, there are still several different, interesting, and practical algorithms in the class. In particular, the OPT policy, LRU, and MRU all turn out to be stack algorithms. When a stack algorithm is in use, the virtual memory system keeps just the pages from the front of the ordering in the primary device; it relegates the remaining pages to the secondary device. As a consequence, if m < n, the set of pages in a primary device of capacity m is always a subset of the set of pages in a primary device of capacity n. Thus a larger memory will always be able to satisfy all of the requests that a smaller memory could—and with luck some additional requests. Put another way, the total ordering ensures that if a particular reference hits in a primary memory of size n, it will also hit in every memory larger than n. When a stack algorithm is in use, the hit ratio in the primary device is thus guaranteed to be a non-decreasing function of increasing capacity. Belady’s anomaly cannot arise.
全排序和子集属性更有趣的特点是,对于给定的页面移除策略,分析师可以通过计算与该策略相关的全排序,通过一次给定的引用字符串对所有可能的主内存大小进行模拟。在每次引用时,一些页面会移动到排序的顶部,而位于其上方的页面会向下移动或保持在同一位置,这取决于页面移除策略。对于每个感兴趣的主内存设备大小,模拟会记录全排序中的这些移动是否也对应于主内存设备与辅助内存设备之间的移动。通过计算这些移动,当它到达引用字符串的末尾时,模拟可以直接计算每个潜在主内存大小的命中率。表 6.7显示了 LRU 策略在使用前面示例中使用的引用字符串运行时的此类模拟结果。在此表中,“size n in/out”行表示 LRU 策略将选择将哪些页面(如果有)带入主内存并从主内存中移除,以满足上述引用。请注意,在每个时间点,“引用后的堆栈内容”都是按自上次使用以来的时间排序的,这正是直觉对 LRU 策略的预测。
The more interesting feature of the total ordering and the subset property is that for a given page-removal policy an analyst can perform a simulation of all possible primary memory sizes, with a single pass through a given reference string, by computing the total ordering associated with that policy. At each reference, some page moves to the top of the ordering, and the pages that were above it either move down or stay in their same place, as dictated by the page-removal policy. The simulation notes, for each primary memory device size of interest, whether or not these movements within the total ordering also correspond to movements between the primary and secondary memory devices. By counting those movements, when it reaches the end of the reference string the simulation can directly calculate the hit ratio for each potential primary memory size. Table 6.7 shows the result of this kind of simulation for the LRU policy when it runs with the reference string used in the previous examples. In this table, the “size n in/out” rows indicate which pages, if any, the LRU policy will choose to bring into and remove from primary memory in order to satisfy the reference above. Note that at every instant of time, the “stack contents after reference” are in order by time since last usage, which is exactly what intuition predicts for the LRU policy.
表 6.7.针对几种主设备大小的 LRU 页面删除策略模拟
Table 6.7. Simulation of the LRU Page-Removal Policy for Several Primary Device Sizes
相比之下,在分析非堆栈算法(如 FIFO)时,必须对每个不同的主要设备容量执行引用字符串的完整模拟,并为每个内存大小构建一个单独的表(如上表)。尝试为 FIFO 创建类似的表很有启发性。
In contrast, when analyzing a non-stack algorithm such as FIFO, one would have to perform a complete simulation of the reference string for each different primary device capacity of interest and construct a separate table such as the one above for each memory size. It is instructive to try to create a similar table for FIFO.
此外,由于引用字符串可用,因此可以知道其未来,分析师可以通过另一次模拟(通过引用字符串向后运行)了解最佳页面删除策略对每个感兴趣的内存大小的同一字符串的执行情况。然后,分析师可以将 OPT 结果与各种可实现的页面删除候选策略进行比较。
In addition, since the reference string is available, its future is known, and the analyst can, with another simulation pass (running backward through the reference string), learn how the optimal page-removal policy would have performed on that same string for every memory size of interest. The analyst can then compare the OPT result with various realizable page-removal candidate policies.
证明最佳页面删除策略可最大限度地减少页面移动,并且可以将其实现为按需堆栈算法并非易事。表 6.8说明该陈述对于前面示例的引用字符串是正确的。侧边栏 6.6提供了为什么 OPT 是堆栈算法并且是最佳算法的直观解释。感兴趣的读者可以在 IBM 研究人员 1970 年的论文 [进一步阅读建议 6.1.2 ]中找到详细的推理,他们介绍了堆栈算法并深入解释了如何在模拟中使用它们。
Proof that the optimal page removal policy minimizes page movements, and that it can be implemented as an on-demand stack algorithm, is non-trivial. Table 6.8 illustrates that the statement is correct for the reference string of the previous examples. Sidebar 6.6 provides the intuition of why OPT is a stack algorithm and optimal. The interested reader can find a detailed line of reasoning in the 1970 paper by the IBM researchers [Suggestions for Further Reading 6.1.2] who introduced stack algorithms and explained in depth how to use them in simulations.
表 6.8.所有主内存大小的最佳页面删除策略
Table 6.8. The Optimal Page-Removal Policy for All Primary Memory Sizes
边栏 6.6 OPT 是一种堆栈算法和最优
Sidebar 6.6 OPT is a Stack Algorithm and Optimal
要了解 OPT 是一种堆栈算法,请从全排序的角度考虑以下对 OPT 的描述:
To see that OPT is a stack algorithm, consider the following description of OPT, in terms of a total ordering:
1.从一个空的主设备和一个将成为全序的空集开始。当触摸每个连续页面时,记下其在全序中的深度d(如果它尚未进入排序,则将d设置为无穷大)并将其移动到全序的前面。
1. Start with an empty primary device and an empty set that will become a total ordering. As each successive page is touched, note its depth d in the total ordering (if it is not yet in the ordering, set d to infinity) and move it to the front of the total ordering.
2.然后,将位于排序最前面的页面向下移动。将其向下移动,直到它位于排序中所有将在再次需要该页面之前被触及的页面之后,或移动到深度d(以先到者为准)。此步骤需要了解未来。
2. Then, move the page that was at the front down in the ordering. Move it down until it follows all pages already in the ordering that will be touched before this page is needed again, or to depth d, whichever comes first. This step requires knowing the future.
3.如果d > m(其中m是主内存设备的大小),则步骤 1 将需要将页面从辅助设备移动到主设备,步骤 2 将需要将页面从主设备移动到辅助设备。
3. If d > m (where m is the size of the primary memory device), step 1 will require moving a page from the secondary device to the primary device, and step 2 will require moving a page from the primary device to the secondary device.
结果是,如果算法从主内存中删除一个页面,它将始终选择未来最长时间不需要的页面。由于所有页面的总排序与主设备的容量无关,因此 OPT 是一种堆栈算法。因此,对于特定的引用字符串,容量为m的主设备中的页面集始终是容量为m + 1的主设备中的页面集的子集。表 6.8说明了此子集属性。
The result is that, if the algorithm removes a page from primary memory, it will always choose the page that will not be needed for the longest time in the future. Since the total ordering of all pages is independent of the capacity of the primary device, OPT is a stack algorithm. Therefore, for a particular reference string, the set of pages in a primary device of capacity m is always a subset of the set of pages in a primary device of capacity m + 1. Table 6.8 illustrates this subset property.
任何基于 LRU 策略的算法都需要在每次内存引用时更新使用情况信息,无论页面是否在主设备和辅助设备之间移动。例如,在虚拟内存系统中,正在运行的程序的每条指令和数据引用都会导致这样的更新。但操纵此使用信息的表示本身可能需要多次内存引用,这会增加原始引用的成本。出于这个原因,大多数多级内存设计人员都在寻找具有大致相同效果但实施成本较低的算法。时钟页面移除算法是 LRU 的一个优雅近似值。
Any algorithm based on the LRU policy requires updating recency-of-usage information on every memory reference, whether or not a page moves between the primary and secondary devices. For example, in a virtual memory system every instruction and data reference of the running program causes such an update. But manipulating the representation of this usage information may itself require several memory references, which escalates the cost of the original reference. For this reason, most multilevel memory designers look for algorithms that have approximately the same effect but are less costly to implement. One elegant approximation to LRU is the clock page-removal algorithm.
时钟算法基于适度的硬件扩展,其中虚拟内存管理器(在处理器的硬件中实现)在处理器引用使用页面表条目时将页面表条目中的一个位(称为引用位)设置为TRUE 。如果在某个时间点,多级内存管理器将每个页面的引用位清除为FALSE,然后应用程序运行一段时间,则对引用位的调查将显示该应用程序使用了哪些页面。时钟算法包括对引用位的系统调查。
The clock algorithm is based on a modest hardware extension in which the virtual memory manager (implemented in the hardware of the processor) sets to TRUE a bit, called the referenced bit, in the page table entry for a page whenever the processor makes a reference that uses that page table entry. If at some point in time the multilevel memory manager clears the referenced bit for every page to FALSE, and then the application runs for a while, a survey of the referenced bits will reveal which pages that application used. The clock algorithm consists of a systematic survey of the referenced bits.
假设主设备的物理块号按数字顺序排列成一个环(即,最高块号后面是块号 0),如图6.9所示。所有引用位最初都设置为FALSE,系统开始运行。稍后,在图 6.9中,我们发现驻留在块 0、1、2、4、6 和 7 中的页面的引用位设置为TRUE,表示某个程序触碰了它们。然后,某个程序引发了缺页异常,系统调用时钟算法来决定驱逐哪个驻留页面以便为缺页腾出空间。时钟算法维护一个指针,就像一个钟表臂一样(这就是它被称为时钟算法的原因)。当虚拟内存系统需要空闲页面时,算法开始顺时针移动指针,一边移动一边测量引用位:
Suppose the physical block numbers of the primary device are arranged in numerical order in a ring (i.e., the highest block number is followed by block number 0), as illustrated in Figure 6.9. All referenced bits are initially set to FALSE, and the system begins running. A little later, in Figure 6.9, we find that the pages residing in blocks 0, 1, 2, 4, 6, and 7 have their referenced bits set to TRUE, indicating that some program touched them. Then, some program causes a missing-page exception, and the system invokes the clock algorithm to decide which resident page to evict in order to make room for the missing page. The clock algorithm maintains a pointer much like a clock arm (which is why it is called the clock algorithm). When the virtual memory system needs a free page, the algorithm begins moving the pointer clockwise, surveying the referenced bits as it goes:
图 6.9时钟页面移除策略的示例操作。
Figure 6.9 Example operation of the clock page-removal policy.
1.如果时钟臂到达引用位为TRUE 的块,则算法将引用位设置为FALSE,并将时钟臂移至下一个块。因此,引用位的含义变为“自时钟臂上次通过以来,处理器已接触过驻留在此块中的页面。”
1. If the clock arm comes to a block for which the referenced bit is TRUE, the algorithm sets the referenced bit to FALSE and moves the arm ahead to the next block. Thus, the meaning of the referenced bit becomes “The processor has touched the page residing in this block since the last pass of the clock arm.”
2.如果时钟臂到达一个引用位为FALSE的块,则意味着自上次时钟臂通过以来,驻留在此块中的页面尚未被触及。因此,此页面是删除的良好候选,因为它的使用时间比引用位设置为TRUE的任何页面都要短。算法选择此页面进行驱逐,并让指向此块的臂在下次执行算法时保持不动。
2. If the clock arm comes to a block for which the referenced bit is FALSE, that means that the page residing in this block has not been touched since the last pass of the clock arm. This page is thus a good candidate for removal, since it has been used less recently than any page that has its referenced bit set to TRUE. The algorithm chooses this page for eviction and leaves the arm pointing to this block for the next execution of the algorithm.
因此,时钟算法会删除它遇到的第一个具有FALSE引用位的块中的页面。如果没有这样的页面(即,自上次时钟臂通过以来,主设备中的每个块都已被触及),时钟将绕一圈,并在移动过程中重置引用位,但在该轮结束时,它将再次回到它检查的第一个块,该块现在具有FALSE引用位,因此它会选择该块中的页面。如果时钟算法从图 6.9所示的状态开始运行,它将选择删除块 3 中的页面,因为这是顺时针方向上第一个具有FALSE引用位的块。
The clock algorithm thus removes the page residing in the first block that it encounters that has a FALSE referenced bit. If there are no such pages (that is, every block in the primary device has been touched since the previous pass of the clock arm), the clock will move all the way around once, resetting referenced bits as it goes, but at the end of that round it will come again to the first block it examined, which now has a FALSE referenced bit, so it chooses the page in that block. If the clock algorithm were run starting in the state depicted in Figure 6.9, it would choose to remove the page in block 3, since that is the first block in the clockwise direction that has a FALSE referenced bit.
时钟算法具有许多优良特性。空间开销很小:主设备的每个块只有一个额外的位。每个页面引用所花费的额外时间很少:强制单个位为TRUE。通常,时钟算法只需扫描一小部分主设备块即可找到具有FALSE引用位的页面。最后,该算法可以增量和推测性地运行。例如,如果虚拟内存系统的设计者希望将空闲块的数量保持在某个阈值以上,它可以提前运行策略,删除最近未使用的页面,并在达到阈值后立即停止移动臂。
The clock algorithm has a number of nice properties. Space overhead is small: just one extra bit per block of the primary device. The extra time spent per page reference is small: forcing a single bit to TRUE. Typically, the clock algorithm has to scan only a small fraction of the primary device blocks to find a page with a FALSE referenced bit. Finally, the algorithm can be run incrementally and speculatively. For example, if the designer of the virtual memory system wants to keep the number of free blocks above some threshold, it can run the policy ahead of demand, removing pages that haven’t been used recently, and stop moving the arm as soon as it has met the threshold.
时钟算法仅提供 LRU 的粗略近似值。它并不严格确定最近最少使用哪个页面,而是简单地将页面分为两类:(1) 自上次扫描以来使用的页面和 (2) 自上次扫描以来未使用的页面。然后,它选择 arm 碰巧在第二个类别中遇到的第一个页面作为牺牲品。这个页面的使用时间比第一个类别中的任何页面都少,但可能不是最近最少使用的页面。对于时钟算法来说,最坏的情况似乎是所有页面的引用位都设置为TRUE;然后时钟算法没有关于哪些页面最近被使用的信息。另一方面,如果主设备中的每个页面自上次扫描以来都被使用过,那么可能没有更好的方法来选择要删除的页面。
The clock algorithm provides only a rough approximation to LRU. Rather than strictly determining which page has been used least recently, it simply divides pages into two categories: (1) those used since the last sweep and (2) those not used since the last sweep. It then chooses as its victim the first page that the arm happens to encounter in the second category. This page has been used less recently than any of the pages in the first category, but is probably not the least-recently-used page. What seems like the worst-case scenario for the clock algorithm would be when all pages have their referenced bit set to TRUE; then the clock algorithm has no information on which to decide which pages have recently been used. On the other hand, if every page in the primary device has been used since the last sweep, there probably isn’t a much better way of choosing a page to remove.
在完全用硬件实现的多级内存系统中,即使是时钟算法也可能涉及太多复杂性,因此设计人员会采用更简单的策略。例如,某些处理器对第5 章中描述的转换后备缓冲器使用随机删除策略。随机删除在此应用中可能非常有效,因为
In multilevel memory systems that are completely implemented in hardware, even the clock algorithm may involve too much complexity, so designers resort to yet simpler policies. For example, some processors use a random removal policy for the translation look-aside buffer described in Chapter 5. Random removal can be quite effective in this application because
its implementation requires minimal state to implement.
如果后备缓冲区足够大,可以容纳当前的翻译工作集,那么随机选择的受害者正是即将需要的翻译的可能性就相对较小。
if the look-aside buffer is large enough to hold the current working set of translations, the chance that a randomly chosen victim turns out to be a translation that is about to be needed is relatively small.
删除错误翻译的惩罚也很小——只需要对速度稍慢的随机存取存储器进行一次额外的引用。
the penalty for removing the wrong translation is also quite small—just one extra reference to a slightly slower random access memory.
另外,一些处理器缓存管理器使用一种完全无状态的策略,称为直接映射,其中选择用于驱逐的页面是位于块n模m中的页面,其中n是丢失页面的辅助设备块号,m是主设备中的块数。如果编译器优化器知道处理器使用直接映射策略,并且知道主设备的大小,它可以通过仔细定位辅助设备中的指令和数据来最大限度地减少缓存未命中次数。
Alternatively, some processor cache managers use a completely stateless policy called direct mapping in which the page chosen for eviction is the one located in block n modulo m, where n is the secondary device block number of the missing page and m is the number of blocks in the primary device. If the compiler optimizer is aware that the processor uses a direct mapping policy, and it knows the size of the primary device, it can minimize the number of cache misses by carefully positioning instructions and data in the secondary device.
在软件中实现页面移除策略时,设计人员可以使用维护更多状态的方法。一种流行的软件策略是“最不常用”策略,它跟踪页面的使用频率。对页面移除策略的完整介绍超出了本书的范围。鼓励读者探索有关此主题的大量文献。
When page-removal policies are implemented in software, designers can use methods that maintain more state. One popular software policy is least-frequently-used, which tracks how often a page is used. Complete coverage of page-removal polices is beyond the scope of this book. The reader is encouraged to explore the large literature on this topic.
页面移除策略只是多级内存管理的一个方面。多级内存管理器的设计者还必须提供适合系统负载的引入策略,并且对于某些系统,可能还包括应对系统抖动的措施。
Page-removal policies are only one aspect of multilevel memory management. The designer of a multilevel memory manager must also provide a bring-in policy that is appropriate for the system load and, for some systems, may include measures to counter thrashing.
到目前为止描述的所有分页系统的引入策略都是,只有当应用程序尝试使用页面时,才会将页面移动到主设备;这样的系统称为按需分页系统。另一种方法称为预分页。在预分页系统中,多级内存管理器会预测可能需要哪些页面,并在应用程序需要它们之前将它们引入。通过移动可能在实际请求之前使用的页面,多级内存管理器可以立即满足未来的引用,而不必等待从较慢的内存中检索页面。例如,当有人启动新应用程序或重新启动一段时间未使用的应用程序时,其页面可能都不在主设备中。为了避免因一次引入大量页面而产生的延迟,多级内存管理器可能会选择将构成应用程序程序文本的所有页面或应用程序在上次执行中使用的所有数据页面作为单个批次进行预分页。
The bring-in policy of all of the paging systems described so far is that pages are moved to the primary device only when the application attempts to use them; such systems are called demand paging systems. The alternative method is known as prepaging. In a prepaging system, the multilevel memory manager makes a prediction about which pages might be needed and brings them in before the application demands them. By moving pages that are likely to be used before they are actually requested, the multilevel memory manager may be able to satisfy a future reference immediately instead of having to wait for the page to be retrieved from a slower memory. For example, when someone launches a new application or restarts one that hasn’t been used for a while, none of its pages may be in the primary device. To avoid the delay that would occur from bringing in a large number of pages one at a time, the multilevel memory manager might choose to prepage as a single batch all of the pages that constitute the program text of the application, or all of the data pages that the application used on a previous execution.
请求调页和预调页都利用推测来提高性能。请求调页推测应用程序将接触刚调入的页面上的其他字节。预调页推测应用程序将使用预调页的页面。
Both demand paging and prepaging make use of speculation to improve performance. Demand paging speculates that the application will touch other bytes on the page just brought in. Prepaging speculates that the application will use the prepaged pages.
在多应用程序系统中出现的一个问题是,各个应用程序的工作集可能无法同时容纳在主设备中。在这种情况下,多级内存管理器可能不得不采取更激烈的措施来避免系统崩溃。其中一种激烈的措施就是交换。当应用程序遇到长时间等待时,多级内存管理器会将其所有页面批量移出主设备。通常可以安排一批写入磁盘的速度比一系列单块写入的速度更快(第 6.3.4 节讨论了这一机会)。此外,将应用程序完全换出会立即为其他应用程序提供空间,因此当它们遇到缺页异常时,无需等待将某些页面移出。但是,要进行交换,多级内存管理器必须能够快速识别主内存中的哪些页面正在被换出的应用程序使用,以及其中哪些页面与其他应用程序共享,因此不应换出。
A problem that arises in a multiple-application system is that the working sets of the various applications may not all simultaneously fit in the primary device. When that is case, the multilevel memory manager may have to resort to more drastic measures to avoid thrashing. One such drastic measure is swapping. When an application encounters a long wait, the multilevel memory manager moves all of its pages out of the primary device in a batch. A batch of writes to the disk can usually be scheduled to go faster than a series of single-block writes (Section 6.3.4 discusses this opportunity). In addition, swapping an application completely out immediately provides space for the other applications, so when they encounter a missing-page exception there is no need to wait to move some page out. However, to do swapping, the multilevel memory manager must be able to quickly identify which pages in primary memory are being used by the application being swapped out, and which of those pages are shared with other applications and therefore should not be swapped out.
交换通常与预分页相结合。当重新启动换出的应用程序时,多级内存管理器会预分页该应用程序的先前工作集,以期以后减少缺页异常的数量。此策略推测,当程序重新启动时,它将需要与换出之前使用的页面相同的页面。
Swapping is usually combined with prepaging. When a swapped-out application is restarted, the multilevel memory manager prepages the previous working set of that application, in the hope of later reducing the number of missing-page exceptions. This strategy speculates that when the program restarts, it will need the same pages that it was using before it was swapped out.
交换和预分页所涉及的权衡是巨大的,而且它们难以进行建模分析,因为很难获得应用程序行为的合理准确模型。幸运的是,技术进步使得这些技术对于一大类系统来说不那么重要。然而,它们仍然适用于需要极高性能的专用系统。
The trade-offs involved in swapping and prepaging are formidable, and they resist modeling analysis because reasonably accurate models of application program behavior are difficult to obtain. Fortunately, technology improvements have made these techniques less important for a large class of systems. However, they continue to be applicable to specialized systems that require the utmost in performance.
在图 6.3中,当某个阶段暂时过载时,就会形成一个请求队列。一个重要的策略决策是确定首先执行队列中的哪些请求。例如,如果磁盘有一个磁盘请求队列,那么磁盘管理器应按照什么顺序调度它们以最小化延迟?再如,某个阶段是否应按照收到请求的顺序调度请求?该策略可能会产生高吞吐量,但可能会导致单个请求的平均延迟较高,因为一个客户端的昂贵请求可能会延迟来自其他客户端的几个廉价请求。这些问题都是如何调度资源的一般问题的例子。本节介绍了对这个一般问题的系统性答案。这个介绍足以解决我们在后面章节中遇到的资源调度问题,但只是触及了调度文献的表面。
When a stage is temporarily overloaded in Figure 6.3, a queue of requests builds up. An important policy decision is to determine which requests from the queue to perform first. For example, if the disk has a queue of disk requests, in which order should the disk manager schedule them to minimize latency? For another example, should a stage schedule requests in the order they are received? That policy may result in high throughput, but perhaps in high average latency for individual requests because one client’s expensive request may delay several inexpensive requests from other clients. These questions are all examples of the general question of how to schedule resources. This section provides an introduction to systematic answers to this general question. This introduction is sufficient to tackle resource scheduling problems that we encounter in later chapters but scratches only the surface of the literature on scheduling.
由于计算机系统中资源底层技术的发展迅速,一些调度决策随着时间的推移变得无关紧要。例如,在 20 世纪 60 年代和 70 年代,当多个用户共享一台计算机并且处理器是性能瓶颈时,在用户之间调度处理器非常重要。随着个人计算机的出现和处理能力的提高,处理器调度变得几乎无关紧要,因为它在大多数情况下不再是性能瓶颈,任何合理的策略都足够好。另一方面,随着大量互联网服务处理数百万付费客户,调度问题变得越来越重要。互联网使网站面临极端的负载变化,这可能导致请求数量超过服务器在某一时刻的处理能力,并且服务必须选择以何种顺序处理排队的请求。
Because the technology underlying resources improves rapidly in computer systems, some scheduling decisions become irrelevant over time. For example, in the 1960s and 1970s when several users shared a single computer and the processor was a performance bottleneck, scheduling the processor among users was important. With the arrival of personal computers and the increase in processing power, processor scheduling became mostly irrelevant because it is no longer a performance bottleneck in most situations, and any reasonable policy is good enough. On the other hand, with massive Internet services handling millions of paying customers, the issue of scheduling has increased in importance. The Internet exposes Web sites to extreme variations in load, which can result in more requests than a server can handle at an instant of time, and the service must make a choice in which order to handle the queued requests.
计算机系统在不同的抽象层次上做出调度决策。在较高的抽象层次上,销售商品的网站可能会为总是购买商品的用户分配比从不购买商品而只是浏览目录的用户更多的内存和处理器时间。在较低的抽象层次上,总线仲裁器必须决定将共享总线分配给哪个处理器的内存引用。
Computer systems make scheduling decisions at different levels of abstraction. At a high level of abstraction, a Web site selling goods might allocate more memory and processor time to a user who always buys goods than to a user who never buys goods but just browses the catalog. At a lower level of abstraction, a bus arbiter must decide to which processor’s memory reference to allocate a shared bus.
尽管在这些示例中,分配决策是在不同的抽象层次上做出的,但调度问题是相似的。从调度的角度来看,计算机系统是需要使用一组资源的实体的集合,而调度是将资源分配给实体的一组策略和调度机制。实体的示例包括线程、地址空间、用户、客户端、服务和请求。资源的示例包括处理器时间、物理内存空间、磁盘空间、网络容量和 I/O 总线时间。将资源分配给实体的策略包括在实体之间平等分配资源、使一个实体优先于另一个实体,以及通过对实体数量执行准入控制来提供一些最低保证。调度程序是实现策略的组件。
Although in these examples allocation decisions are made at different levels of abstraction, the scheduling problem is similar. From the perspective of scheduling, a computer system is a collection of entities that require the use of a set of resources, and scheduling is the set of policies and dispatch mechanisms to allocate resources to entities. Examples of entities include threads, address spaces, users, clients, services, and requests. Examples of resources include processor time, physical memory space, disk space, network capacity, and I/O-bus time. Policies to assign resources to entities include dividing the resources equally among the entities, giving one entity priority over another entity, and providing some minimum guarantee by performing admission control on the number of entities. The scheduler is the component that implements a policy.
设计正确的策略很困难,因为高层目标和可用策略之间、所选策略和要调度的机制之间以及所选机制与其实际实施之间通常存在差距。我们将依次讨论这些挑战。
Designing the right policy is difficult because there are usually gaps between the high-level goal and the available policy, between the chosen policy and mechanism to dispatch, and between the chosen mechanism and its actual implementation. We discuss each of these challenges in turn.
所需的调度策略可能包含计算机系统使用环境中的元素,但这些元素很难在计算机系统中捕获。例如,网站如何识别高价值客户(即可能进行大额购买的客户)?高价值用户可能从未在此网站上购买过商品,或者很难将匿名目录浏览请求与特定的先前客户联系起来。即使我们可以将请求与特定客户联系起来,该请求也可能遍历网站的多个模块,其中一些模块可能没有用户的概念。例如,包含价格和商品信息的数据库可能无法优先处理来自重要客户的请求。
The desired scheduling policy might incorporate elements of the environment in which the computer system is used but that are difficult to capture in a computer system. For example, how can a Web site identify a high-value customer (that is, one who is likely to make a large purchase)? The high-value user might never have bought before at this site, or it may be difficult to associate an anonymous catalog-browsing request with a particular previous customer. Even if we could identify the request with a particular customer, the request may traverse several modules of the Web site, some of which may have no notion of users. For example, the database that contains information about prices and goods might be unable to prioritize requests from an important customer.
如果我们能够构建正确的策略,那么就存在一个挑战,即确定实施该策略的机制。一个模块可能实施了调度策略,但由于另一个模块不知道这一点,因此该策略无效。例如,我们可能希望赋予文本编辑器高优先级,以便为用户提供良好的交互体验。我们可以轻松更改线程调度程序,使运行编辑器的线程的优先级高于任何其他可运行线程。但是,总线仲裁器、共享文件服务或磁盘调度程序如何知道代表编辑器的内存、文件或磁盘请求应该比其他磁盘或内存请求具有更高的优先级?更糟糕的是,磁盘调度程序可能会延迟操作以批量处理磁盘请求以实现高吞吐量,但这一决定可能会导致文本编辑器的交互性能不佳,因为它的请求被延迟了。
If we can construct the right policy, then there is the challenge of identifying the mechanism to implement the policy. One module might implement a scheduling policy, but because another module is not aware of it, the policy is ineffective. For example, we might desire to give the text editor high priority to provide a good interactive experience to users. We can easily change the thread scheduler to give the thread running the editor higher priority than any other runnable thread. However, how does the bus arbiter, shared file service, or disk scheduler know that a memory, file, or disk request on behalf of the editor should have higher priority than other disk or memory requests? Worse, the disk scheduler is likely to delay operations to batch disk requests to achieve high throughput, but this decision may result in bad interactive performance for the text editor because its requests are delayed.
最后的挑战是正确实施该机制。侧栏 6.7中的接收活锁提供了一个示例,说明两个调度程序很容易产生不良交互。它说明,设计一个不会在过载下崩溃的计算机系统是一项挑战,需要设计人员仔细考虑所有实施决策。
The final challenge is getting the actual implementation of the mechanism right. Sidebar 6.7 on receive livelock provides an example of how easy it is for two schedulers to interact badly. It illustrates that to design a computer system that doesn’t collapse under overload is a challenge and requires that a designer carefully think through all implementation decisions.
侧边栏 6.7 接收活锁
Sidebar 6.7 Receive Livelock
当系统暂时过载时,对过载情况做出有效响应非常重要。响应不必完美无缺,但必须确保系统不会崩溃。例如,假设 Web 新闻服务器每秒可以处理 1,000 个请求,但不久前洛杉矶发生了大地震,请求以每秒 10,000 个的速度到达。目标是成功处理(可能是随机的)10% 的负载,但如果设计人员不小心,服务器最终可能会处理 0%。该问题称为接收活锁,如果服务器花费太多时间说“我太忙”,结果永远没有机会处理任何请求,就会出现这种情况。考虑以下具有有限缓冲区的简单中断驱动 Web 服务:
When a system is temporarily overloaded, it is important to have an effective response to the overload situation. The response doesn’t have to be perfect, but it must ensure that the system doesn’t collapse. For example, suppose that a Web news server can handle 1,000 requests per second, but a short time ago there was a big earthquake in Los Angeles and requests are arriving at the rate 10,000 per second. The goal is to successfully serve (perhaps a random) 10% of the load, but if the designer isn’t careful, the server may end up serving 0%. The problem is called receive livelock, and it can arise if the server spends too much of its time saying “I’m too busy” and as a result never gets a chance to serve any of the requests. Consider the following simple interrupt-driven Web service with a bounded buffer:
当请求到达网络设备时,设备会产生中断,从而导致中断处理程序运行。中断处理程序将请求从设备复制到有界缓冲区,并重新启用中断,以便接收下一个请求。服务有一个线程,它使用有界缓冲区中的请求。当服务过载且请求到达速度快于服务处理速度时,如上所述,系统会达到完全不处理任何请求的状态,因为它会遇到接收活锁。
When a request arrives on the network device, the device generates an interrupt, which causes the interrupt handler to run. The interrupt handler copies the request from the device into a bounded buffer and reenables interrupts so that it can receive the next request. The service has a single thread, which consumes requests from the bounded buffer. When the service is overloaded and requests arrive faster than the service can process them, then the system as described reaches a state where it serves no requests at all because it experiences receive livelock.
考虑一下当请求到达的速度远远快于服务处理它们的速度时会发生什么。当服务线程正在处理请求时,处理器从网络设备收到中断,并且中断处理程序运行。中断处理程序将请求复制到缓冲区中,通知服务线程,然后返回,重新启用中断。一旦处理程序重新启用中断,另一个请求的到达可能会再次中断处理器,从而调用中断处理程序。中断处理程序将执行与之前相同的顺序,直到缓冲区填满;然后它别无选择,只能丢弃请求并从中断返回,重新启用中断。如果网络设备有另一个可用的请求,它将立即再次中断处理器;中断处理程序将丢弃该请求并返回。只要请求到达的速度快于中断处理程序运行的时间,此事件序列就会无限期地持续下去。我们收到了活锁:服务从不运行,因此服务每秒处理的请求数降至零;对于用户来说,该网站似乎已关闭!
Consider what happens when requests arrive much faster than the service can process them. While the service thread is processing a request, the processor receives an interrupt from the network device and the interrupt handler runs. The interrupt handler copies the request into the buffer, notifies the service thread, and returns, reenabling interrupts. As soon as the handler reenables interrupts, the arrival of another request may interrupt the processor again, invoking the interrupt handler. The interrupt handler goes through the same sequence as before until the buffer fills up; then it has no other choice than to discard the request and return from the interrupt, reenabling interrupts. If the network device has another request available, it will interrupt the processor immediately again; the interrupt handler will throw the request away and return. This sequence of events continues indefinitely as long as requests arrive faster than the time for the interrupt handler to run. We have receive livelock: the service never runs, and as a result the number of requests processed by the service per second drops to zero; to users the Web site appears to be down!
这里的问题是处理器的内部调度程序与线程调度程序交互不良。从概念上讲,处理器调度主线程和中断线程,线程管理器在服务线程和任何其他线程之间调度主处理器线程。处理器调度程序为中断线程赋予绝对优先级,在中断到达时立即调度它;主线程永远没有机会运行线程管理器,因此服务线程永远不会获得处理器。当某些处理必须在中断处理程序之外执行时,就会出现此问题。可以考虑将所有处理移到中断处理程序中。这种方法有其自身的问题(如第 5.6.4 节所述),并且抵消了使用线程的模块化优势。但是,一旦将问题表述为调度问题,就可以找到解决方案。
The problem here is that the processor’s internal scheduler interacts badly with the thread scheduler. Conceptually, the processor schedules the main thread and the interrupt thread, and the thread manager schedules the main processor thread among the service thread and any other threads. The processor scheduler gives absolute priority to the interrupt thread, scheduling it as soon as an interrupt arrives; the main thread never gets a chance to run the thread manager, and as a result the service thread never receives the processor. This problem occurs when some processing must be performed outside of the interrupt handler. One could contemplate moving all processing into interrupt handlers. This approach has its own problems (as discussed in Section 5.6.4) and negates the modularity advantages of using threads. However, once the problem is stated as a scheduling problem, a solution is available.
解决方案 [进一步阅读建议 6.4.2 ] 是修改调度策略,以便当有界缓冲区中有可用请求时,服务线程有机会运行。只需对中断处理程序进行轻微修改,即可实现此策略。如果有界缓冲区已填满,中断处理程序在返回时不应重新启用中断。当服务线程耗尽有界缓冲区(例如,只剩一半)时,它应该重新启用中断。此策略可确保网络设备不会丢弃请求,除非缓冲区已满(即存在过载情况),并且服务线程有机会处理请求,从而避免活锁。
The solution [Suggestions for Further Reading 6.4.2] is to modify the scheduling policy so that the service thread gets a chance to run when requests are available in the bounded buffer. This policy can be implemented with a slight modification to the interrupt handler. If the bounded buffer fills up, the interrupt handler should not reenable interrupts as it returns. When the service thread has drained the bounded buffer, say, to only half full, it should reenable interrupts. This policy ensures that the network device doesn’t discard requests unless the buffer is full (i.e., there is an overload situation) and sees that the service thread gets a chance to process requests, avoiding livelock.
请求仍有可能被丢弃。如果网络设备收到请求但无法生成中断,则该设备别无选择,只能丢弃下一个请求。这种情况是不可避免的:如果网络可以生成高于服务容量的负载,则设备必须减轻负载。好消息是,在过载的情况下,系统至少会处理一些请求,而不是完全不处理。
It is still possible that requests may be discarded. If the network device receives a request but it cannot generate an interrupt, the device has no other choice than to discard the next request. This situation is unavoidable: if the network can generate a higher load than the capacity of the service, the device must shed load. The good news is that under overload the system will at least process some requests rather than none at all.
设计和实施调度程序的挑战非常艰巨,但幸运的是,复杂的调度程序通常不是计算机系统的必需品。航空公司使用复杂的调度算法,因为它们处理的是真正昂贵且稀缺的资源(例如飞机、着陆位置和燃料),以及峰值负载可能远高于正常负载的情况(例如,家庭假期旅行)。通常,在计算机系统中,很少有资源真正稀缺,简单的策略、机制和实现就足够了。
The list of challenges in designing and implementing schedulers is formidable, but fortunately sophisticated schedulers are often not a requirement for computer systems. Airlines use sophisticated and complex scheduling algorithms because they deal with genuinely expensive and scarce resources (such as airplanes, landing slots, and fuel) and situations in which the peak load can be far larger than usual load (e.g., travel around family holidays). Usually, in a computer system few resources are truly scarce, and simple policies, mechanisms, and implementations suffice.
6.3 节的其余部分介绍了计算机系统中调度程序的一些常见目标,描述了实现这些目标的一些基本策略,并介绍了一个磁盘臂调度案例研究。在此过程中,本节指出了一些调度陷阱,例如接收活锁和优先级反转。
The rest of Section 6.3 introduces some common goals for a scheduler in a computer system, describes some basic policies to achieve these goals, and presents a case study of scheduling a disk arm. Along the way, the section points out a few scheduling pitfalls, such as receive livelock and priority inversion.
要了解调度程序的可能目标,请考虑上一章中的线程调度程序。它从一组可运行线程中选择一个线程。在图 5.24中的线程管理器实现中,调度程序按照线程表中出现的顺序选择线程。此调度策略是许多可能的策略之一。
To appreciate possible goals for a scheduler, consider the thread scheduler from the previous chapter. It chooses a thread from a set of runnable threads. In the implementation of the thread manager in Figure 5.24, the scheduler picks the threads in the order in which they appear in the thread table. This scheduling policy is one of many possible policies.
通过稍微重构线程调度程序,它可以轻松实现不同的策略。线程管理器的更通用实现将遵循设计提示“将机制与策略分开”(参见边栏 6.5)。此实现将调度机制(用于暂停和恢复线程的机制)与调度策略(选择下一个要运行的线程)分开,将它们放入各自的程序中,这样设计人员就可以更改策略而无需更改调度机制。
By slightly restructuring the thread scheduler, it could implement different policies easily. A more general implementation of the thread manager would follow the design hint separate mechanism from policy (see Sidebar 6.5). This implementation would separate the dispatch mechanism (the mechanisms for suspending and resuming a thread) from scheduling policy (selecting which thread to run next) by putting them into their own procedures, so that a designer can change the policy without having to change the dispatch mechanism.
设计人员可能想要更改策略,因为没有单一的最佳调度策略。“最佳”在不同情况下可能意味着不同的事情。例如,实现良好的整体效率和为单个请求提供良好的服务之间存在矛盾。从系统的角度来看,“最佳”的两个重要指标是吞吐量和利用率。使用良好的调度程序,吞吐量会随着提供的负载线性增长,直到吞吐量达到系统的容量。良好的调度程序还将确保系统不会在过载条件下崩溃。最后,良好的调度程序是高效的:它本身不会消耗太多资源。需要 90% 的处理器时间来完成其工作的调度程序没有多大价值。
A designer may want to change the policy because there is no one single best scheduling policy. “Best” might mean different things in different situations. For example, there is tension between achieving good overall efficiency and providing good service to individual requests. From the system’s perspective, the two important measures for “best” are throughput and utilization. With a good scheduler, throughput grows linearly with offered load until throughput hits the capacity of the system. A good scheduler will also ensure that a system doesn’t collapse under overload conditions. Finally, a good scheduler is efficient: it doesn’t consume many resources itself. A scheduler that needs 90% of the processor’s time to do its job is not of much value.
应用程序通过在请求到达时立即进行调度并处理完成(无需重新调度)来实现高吞吐量。例如,任何时候线程调度程序启动一个线程,但随后抢占该线程来运行另一个线程,都会延迟被抢占的线程。因此,为了让应用程序实现高吞吐量,调度程序必须尽量减少抢占次数和调度决策次数。不幸的是,这个系统级目标可能与单个线程的需求相冲突。
Applications achieve high throughput by being immediately scheduled when a request arrives and processing it to completion, without being rescheduled. For example, any time a thread scheduler starts a thread, but then preempts it to run another thread, it is delaying the preempted thread. Thus, for an application to achieve high throughput, a scheduler must minimize the number of preemptions and the number of scheduling decisions. Unfortunately, this system-level goal may conflict with the needs of individual threads.
每个单独的请求都需要良好的服务,这通常意味着良好的响应:它很快开始并快速完成。有几种方法可以衡量请求的响应:
Each individual request wants good service, which typically means good response: it starts soon and completes quickly. There are several ways of measuring a request’s response:
Turnaround time. The length of time from when a request arrives at a service until it completes.
响应时间。从请求到达服务到开始产生输出的时间长度。对于交互式请求,此指标通常比周转时间更有用。例如,许多 Web 浏览器针对此指标进行了优化。通常,浏览器在收到部分内容(例如文本)后会立即显示不完整的网页,稍后再填充剩余部分(例如图像)。
Response time. The length of time from when a request arrives at a service until it starts producing output. For interactive requests, this measure is typically more useful than turnaround time. For example, many Web browsers optimize for this metric. Typically, a browser displays an incomplete Web page as soon as the browser receives parts of it (e.g., the text) and fills in the remainder later (e.g., images).
等待时间。从请求到达服务到服务开始处理请求的时间长度。此度量比周转时间更好,因为它捕获了线程必须等待多长时间(即使线程已准备好执行)。理想的等待时间为零秒。
Waiting time. The length of time from when a request arrives at a service until the service starts processing the request. This measure is better than turnaround time, since it captures how long the thread must wait even though it is ready to execute. The ideal waiting time is zero seconds.
通过使用其中一些指标和某种组合方式来组合所有请求的性能,也可以实现更复杂的指标。例如,可以将平均等待时间计算为所有请求等待时间的平均值。同样,可以计算等待时间的总和、响应时间的方差等等。
More sophisticated measures are also possible by combining the performance of all requests using some of these measures and some way of combining. For example, one can compute average waiting time as the average of waiting times of all requests. Similarly, one can calculate the sum of the waiting times, the variance in response time, and so on.
在交互式计算机系统中,许多请求都是代表坐在显示器前的人类用户发出的。因此,用户的感知是请求所获服务优劣的另一个衡量标准。例如,交互式用户可能倾向于认为响应时间的高方差比高平均值更令人讨厌。另一方面,比人类反应时间更快的响应时间可能不会改善优劣感知。
In an interactive computer system, many requests are on behalf of a human user sitting in front of a display. Therefore, the perception of the user is another measure of the goodness of the service that a request receives. For example, an interactive user may tend to perceive a high variance in response time to be more annoying than a high mean. On the other hand, a response time that is faster than the human reaction time may not improve the perception of goodness.
有时,设计人员希望调度程序能够提供一定程度的公平性,这意味着每个请求都能获得共享服务的平等份额。不让某个请求提供服务而去服务其他请求的调度程序是不公平的调度程序。不公平的调度程序不一定是坏的调度程序;它可能比公平调度程序具有更高的吞吐量和更好的响应时间。
Sometimes a designer desires a scheduler that provides some degree of fairness, which means that each request obtains an equal share of the shared service. A scheduler that starves a request to serve other requests is an unfair scheduler. An unfair scheduler is not necessarily a bad scheduler; it may have higher throughput and better response time than a fair scheduler.
很容易让人相信,设计一个能够同时优化公平性、吞吐量和响应时间的调度程序是一项不可能完成的任务。因此,存在许多不同的调度算法;每种算法都从不同的维度进行优化。
It is easy to convince oneself that designing a scheduler that optimizes for fairness, throughput, and response time all at the same time is an impossible task. As a result, there are many different scheduling algorithms; each one of them optimizes along different dimensions.
为了说明一些基本的调度算法,我们在线程管理器的上下文中介绍了其中的一些算法。目标是在多个线程之间高效地共享处理器。例如,当一个线程被阻塞等待 I/O 时,我们希望在处理器上运行另一个可运行的线程。这些线程可能在共享计算机上运行不同的程序,或者在专用计算机上协作实现高性能 Web 服务的多个线程。
To illustrate some basic scheduling algorithms, we present a number of them in the context of a thread manager. The objective is to share the processor efficiently among multiple threads. For example, when one thread is blocked waiting for I/O, we would like to run a different, runnable thread on the processor. These threads might be running different programs on a shared computer, or a number of threads that cooperate to implement a high-performance Web service on a dedicated computer.
由于线程通常经历一个运行和等待的循环(例如,等待用户输入、客户端请求或磁盘请求的完成),因此将线程建模为一系列作业很有用。每个作业对应一次活动突发。
Since threads typically go through a cycle of running and waiting (e.g., waiting for user input, a client request, or completion of disk request), it is useful to model a thread as a series of jobs. Each job corresponds to one burst of activity.
我们调查了几种不同的算法来调度这些作业。许多教科书、讲义和论文更详细地探讨了这些算法,我们的描述基于这些文献。虽然这些算法是在单处理器的线程管理器上下文中描述的,但这些算法是通用的,也适用于其他环境。例如,它们在多处理器上同样有效,具有相同的优缺点,但当多个作业同时运行时,它们更难说明。这些算法也适用于磁盘臂调度,我们将在第 6.3.4 节中讨论。
We survey a few different algorithms to schedule these jobs. Many textbooks, lecture notes, and papers explore these algorithms in more detail, and our description is based on this literature. Although the algorithms are described in the context of a thread manager for a single processor, the algorithms are generic and apply to other contexts as well. For example, they work equally well for multiprocessors and have the same pros and cons, but they are harder to illustrate when several jobs run concurrently. The algorithms also apply to disk-arm scheduling, which we shall discuss in Section 6.3.4.
在繁忙的邮局,顾客可能会被要求在走进来时拿一张带有号码的票,然后等待直到票上的号码被叫到。通常,邮局会严格按照递增顺序分配号码,并按该顺序叫号码。此策略称为先到先得 (FCFS) 调度程序,一些线程管理器也使用它。
At a busy post office, customers may be asked to take a ticket with a number as they walk in and wait until the number on their ticket is called. Typically, the post office allocates the numbers in strict increasing order and calls the numbers in that order. This policy is called a first-come, first-served (FCFS) scheduler and some thread managers use it too.
线程管理器可以通过将就绪列表组织为先进先出队列来实现先来先服务的策略。管理器只需运行队列中的第一个作业,直到它完成;然后管理器运行下一个作业,即现在的第一个作业,依此类推。当作业就绪时,调度程序只需将其添加到队列末尾。
A thread manager can implement the first-come, first-served policy by organizing the ready list as a first-in, first-out queue. The manager simply runs the first job on the queue until it finishes; then the manager runs the next job, which is now the first job, and so on. When a job becomes ready, the scheduler simply adds it to the end of the queue.
为了说明和分析调度策略的行为,文献使用了作业到达序列,其中每个作业都有特定的工作量。我们采用一个特定的序列,它说明了我们介绍的调度算法之间的差异。该序列如下:
To illustrate and analyze the behavior of a scheduling policy, the literature uses sequences of job arrivals, in which each job has a specific amount of work. We adopt one particular sequence, which illustrates the differences between the scheduling algorithms that we cover. This sequence is the following:
| 工作 | 到达时间 | 工作量 |
| A | 0 | 3 |
| 乙 | 1 | 5 |
| C | 3 | 2 |
给定一个特定的顺序,我们可以绘制一条时间线来描述线程管理器何时调度作业。对于上述顺序和先到先得的策略,该时间线如下:
Given a specific sequence, one can draw a timeline that depicts when the thread manager dispatches jobs. For the above sequence and the first-come, first-served policy this timeline is as follows:
根据这个时间表,我们可以填写一张包含完成时间和等待时间的表格,并对政策做出一些观察。对于上述时间表和先到先得政策,该表格如下:
Given this timeline, one can fill out a table that includes finish time and waiting times, and make some observations about a policy. For the above timeline and the first-come, first-served policy this table is as follows:
从表中我们可以看出,对于给定的作业序列,先到先得策略有利于耗时较长的作业 A 和 B。作业 C 等待 5 秒才能开始执行耗时 2 秒的作业。相对于工作量,作业 C 受到的惩罚最多。
From the table we can see that for the given job sequence, the first-come, first-served policy favors the long jobs A and B. Job C waits 5 seconds to start a job that takes 2 seconds. Relative to the amount of work, job C is punished the most.
由于先到先得的原则有利于处理较长的作业,而不是较短的作业,因此系统可能会陷入不良状态。考虑一下,如果我们的系统有一个线程定期等待 I/O 但主要进行计算,还有几个线程主要执行 I/O 操作,会发生什么情况。假设调度程序首先运行 I/O 绑定线程。它们将快速完成作业并开始 I/O 操作,让调度程序运行处理器绑定线程。一段时间后,I/O 绑定线程将完成 I/O 并在处理器绑定线程后面排队,使所有 I/O 设备处于空闲状态。当处理器绑定线程完成其作业时,它将启动其 I/O 操作,从而允许调度程序运行 I/O 绑定线程。
Because first-come, first-served can favor long jobs over short jobs, a system can get into an undesirable state. Consider what happens if we have a system with one thread that periodically waits for I/O but mostly computes and several threads that perform mostly I/O operations. Suppose the scheduler runs the I/O-bound threads first. They will all quickly finish their jobs and go start their I/O operations, leaving the scheduler to run the processor-bound thread. After a while, the I/O-bound threads will finish their I/O and queue up behind the processor-bound thread, leaving all the I/O devices idle. When the processor-bound thread finishes its job, it initiates its I/O operation, allowing the scheduler to run the I/O-bound threads.
和以前一样,I/O 密集型线程将快速完成计算并启动 I/O 操作。现在处理器处于空闲状态,而所有线程都在等待其 I/O 操作完成。由于处理器密集型线程首先启动了 I/O,因此它很可能会先完成,抢占处理器并让所有其他线程等待才能运行。系统将以这种方式继续,在处理器繁忙且所有 I/O 设备空闲的时间段与处理器空闲且所有线程都在护送中执行 I/O 的时间段之间交替,这就是为什么文献有时将这种情况称为护送效应。错失了拥有线程的主要机会,因为在这种护送场景中,系统从不将计算与 I/O 重叠。
As before, the I/O-bound threads will quickly finish computation and initiate an I/O operation. Now we have the processor sitting idle, while all the threads are waiting for their I/O operations to complete. Since the processor-bound thread started its I/O first, it will likely finish first, grabbing the processor and making all the other threads wait before they can run. The system will continue this way, alternating between periods when the processor is busy and all the I/O devices are idle with periods when the processor is idle and all the threads are doing I/O in a convoy, which is why the literature sometimes refers to this case as a convoy effect. The main opportunity for having threads is missed, since in this convoy scenario the system never overlaps computation with I/O.
这种情况在实践中不太可能实现,因为工作负载不太可能具有计算和 I/O 线程的正确组合,从而产生一系列调度决策,导致 I/O 与计算完全不重叠的情况。尽管如此,它还是启发了研究人员思考先到先得以外的其他政策。
This scenario is unlikely to materialize in practice because workloads are unlikely to have exactly the right mix of computing and I/O threads that would produce a sequence of scheduling decisions that lead to a situation where I/O isn’t overlapped at all with computation. Nevertheless, it has inspired researchers to think about policies other than first-come, first-served.
先到先得政策的不良情况建议使用另一种调度程序:最短作业优先调度程序。每当需要分派作业时,调度程序都会选择预期运行时间最短的作业。最短作业优先要求调度程序在运行作业之前预测作业的运行时间。在一般情况下,很难预测作业的运行时间,但在实践中,有些特殊情况是可行的。
The undesirable scenario with the first-come first-served policy suggests another scheduler: a shortest-job-first scheduler. Whenever the time comes to dispatch a job, the scheduler chooses the job that has the shortest expected running time. Shortest-job-first requires that the scheduler has a prediction of the running time of a job before running it. In the general case, it is difficult to make predictions of the running time of a job, but in practice there are special cases that can work.
假设我们事先知道一个作业的运行时间,并看看最短作业优先调度程序在示例序列上的表现:
Let’s assume we know the running time of a job beforehand and see how a shortest-job-first scheduler performs on the example sequence:
我们可以看到,作业 C 在作业 B 之前运行,因为当调度程序在作业 A 完成后运行时,它会选择 C 而不是 B,因为作业 C 刚刚进入系统,需要的时间比作业 B 少。下面是最短作业优先策略的完整表格:
As we can see, job C runs before job B because when the scheduler runs after job A completes, it picks C instead of B, since job C has just entered the system and needs less time than job B. Here is the complete table for the shortest-job-first policy:
作业 B 的等待时间增加了,但相对于它要做的工作量,它要等待的时间比先到先得政策下的作业 C 要少。与先到先得政策相比,最短作业优先政策的总等待时间减少了(4 比 7)。
Job B’s waiting time has increased, but relative to the amount of work it has to do, it has to wait less than job C did under the first-come, first-served policy. The total amount of waiting time for the shortest-job-first policy decreased compared to the first-come, first-served policy (4 versus 7).
最短作业优先策略有一个实施挑战:我们如何知道一项作业需要完成的工作量?在某些情况下,我们可能能够在运行作业之前确定这是否是一项短作业。例如,如果我们有两个读取磁盘上不同扇区的请求,并且磁盘臂靠近其中一个扇区,那么需要将磁盘臂移到较近轨道的请求就是较短的作业。
The shortest-job-first policy has one implementation challenge: how do we know the amount of work a job has to do? In some cases, we may be able to decide before running the job whether or not this is a short job. For example, if we have two requests for reading different sectors on the disk and the disk arm is close to one of them, then the request that requires moving the arm to the closer track is the shorter job.
如果我们无法在不执行作业的情况下确定该作业是否为短作业,那么我们可以通过假设作业属于不同的类别来取得一些进展:交互式线程主要有短作业,而计算密集型线程可能主要有长作业。这表明,如果我们跟踪线程的过去行为,那么我们可能能够预测其未来行为。例如,如果线程刚刚完成一项短作业,我们可能预测其下一个作业也将是短作业。我们可以通过基于给定线程的所有过去作业进行预测来使这个想法更加精确。这样做的一种方法是使用指数加权移动平均线 (EWMA)(参见边栏 7.6 [在线])。当然,过去的行为可能只是未来行为的一个弱指标。
If we cannot decide without executing a job whether or not the job is short, we can make some forward progress by assuming that jobs fall in different classes: a thread that is interactive has mostly short jobs, while a thread that is computationally intensive is likely to have mostly long jobs. This suggests that if we track the past behavior of a thread, then we might be able to predict its future behavior. For example, if a thread just completed a short job, we might predict that its next job also will be short. We can make this idea more precise by basing our prediction on all past jobs of a given thread. One way of doing so is using an Exponentially Weighted Moving Average (EWMA) (see Sidebar 7.6 [on-line]). Of course, past behavior may be a weak indicator of future behavior.
与先到先得策略相比,最短作业优先策略的一个缺点是,最短作业优先策略可能导致饥饿。几个线程完全由短作业组成,它们加在一起的负载大到足以耗尽可用的处理器,这可能会阻止长作业的运行。在实践中,正如我们将在第6.3.3.4和6.3.4节中看到的那样,最短作业优先策略可以与其他策略相结合,以避免饥饿。
A disadvantage of the shortest-job-first policy versus the first-come first-served policy is that shortest-job-first may lead to starvation. Several threads that consist entirely of short jobs and that together present a load large enough to use up the available processors may prevent a long job from ever being run. In practice, as we will see in Sections 6.3.3.4 and 6.3.4, the shortest-job-first policy can be combined with other policies to avoid starvation.
最短作业优先法的一个问题是确定哪些作业短,哪些作业长。一种方法是使用抢占式调度将长作业分解为多个较小的作业,从而使所有作业变短。抢占式调度策略会在一定时间后停止作业,以便调度程序可以选择另一个作业,并在稍后的某个时间恢复被抢占的作业。正如我们在第 5 章中讨论的那样,抢占式调度还有一个好处,即它强制模块化;编程错误不会导致作业永远不释放处理器。
One of the issues with shortest job first is identifying which jobs are short and which are long. One approach is to make all jobs short by breaking long jobs up into a number of smaller jobs using preemptive scheduling. A preemptive scheduling policy stops a job after a certain amount of time so that the scheduler can pick another job, resuming the preempted one at some time later. As we discussed in Chapter 5, preemptive scheduling also has the benefit that it enforces modularity; a programming error cannot cause a job to never release the processor.
一种简单的抢占式调度策略是循环调度。循环调度程序像以前一样维护一个可运行作业队列。它从该队列中选择第一个作业,就像先到先得策略一样,但会在一段时间后停止该作业,然后选择一个新作业。一段时间后,调度程序将再次选择已停止的作业并再次运行它,但运行时间不超过固定时间,依此类推,直到作业完成。
A simple preemptive scheduling policy is round-robin scheduling. A round-robin scheduler maintains a queue of runnable jobs as before. It selects the first job from this queue, as in the first-come first-serve policy, but stops the job after some period of time, and selects a new job. Some time later the scheduler will select the stopped job again and run it again for no longer than the fixed period of time, and so on, until the job completes.
循环调度可以按如下方式实现。在运行作业之前,循环调度程序会设置一个具有固定时间值的计时器,称为时间片。计时器到期时,它会引发中断,中断处理程序会调用YIELD。此调用将控制权交还给调度程序,调度程序会将作业移至队列末尾并从队列前面选择一个新作业。时间片应该足够长,以便大多数短作业都能在不被中断的情况下完成,并且它应该足够短,以便大多数长作业都会被中断,这样短作业就可以更快地运行。
Round-robin can be implemented as follows. Before running the job, the round-robin scheduler sets a timer with a fixed time value, called a quantum. When the timer expires, it causes an interrupt and the interrupt handler calls YIELD. This call gives control back to the scheduler, which moves the job to the end of the queue and selects a new job from the front of the queue. The quantum should be long enough that most short jobs complete without being interrupted, and it should be short enough that most long jobs do get interrupted so that short jobs can get to run sooner.
让我们看一下具有 1 秒时间片的循环调度程序在示例序列上的表现:
Let’s look at how a round-robin scheduler with a quantum of 1 second performs on the example sequence:
在时间 0,只有 A 在可运行作业队列中,因此调度程序选择它。在时间 1,B 在队列中,因此调度程序选择 B 并将 A 附加到队列末尾,因为它尚未完成。在时间 2,A 在最前面,因此调度程序选择 A 并将 B 附加到队列末尾。在时间 3,调度程序将 C 附加到队列末尾的 B 之后。然后,调度程序选择 B,因为它在队列的最前面,并将 A 附加在 C 之后。在时间 4,调度程序将 B 附加到队列末尾并选择 C 运行。在时间 5,调度程序将 C 附加到队列末尾并选择 A。在时间 6,A 已完成,调度程序选择 B,依此类推。
At time 0, only A is in the queue of runnable jobs, so the scheduler selects it. At time 1, B is in the queue so the scheduler selects B and appends A to the end of the queue, since it is not done. At time 2, A is at the front, so the scheduler selects A and appends B to the end of the queue. At time 3, the scheduler appends C to the end of the queue after B. Then, the scheduler selects B, since it is at the front of the queue, and appends A after C. At time 4, the scheduler appends B to the end of the queue and selects C to run. At time 5, the scheduler appends C to the end of the queue and selects A. At time 6, A is done, and the scheduler selects B, and so on.
该时间线的结果如下表:
This timeline results in the following table:
从本例可以看出,与先到先得和最短作业优先相比,循环调度导致完成单个作业的性能最差(以自开始以来经过的总时间衡量)。这并不奇怪,因为循环调度程序会强制长作业在一段时间后停止。
As can been seen in this example, compared to first-come, first-served and shortest-job-first, round-robin results in the worst performance to complete an individual job, measured in total time elapsed since start. This is not surprising because a round-robin scheduler forces long jobs to stop after a quantum of time.
然而,循环调度具有最短的总等待时间,因为循环调度的作业开始得更早:每个作业运行的时间都不会超过一个时间段,然后就会停止并由调度程序选择另一个作业。
Round-robin, however, has the shortest total waiting time because with round-robin jobs start earlier: every job runs no longer than a quantum before it is stopped and the scheduler selects another job.
循环有利于运行时间少于一个量子的作业,而牺牲了运行时间超过一个量子的作业,因为调度程序将在一个量子之后停止一个长作业并运行短作业,然后再将处理器返回给长作业。循环在许多计算机系统中都存在,因为许多计算机系统都是交互式的,有短作业,并且快速响应可以提供良好的用户体验。
Round-robin favors jobs that run for less than a quantum at the expense of jobs that are more than a quantum long, since the scheduler will stop a long job after one quantum and run the short one before returning the processor to the long one. Round-robin is found in many computer systems because many computer systems are interactive, have short jobs, and a quick response provides a good user experience.
有些任务比其他任务更重要。例如,执行诸如垃圾收集未使用的临时文件之类的次要事务的系统线程可能比运行用户程序的线程具有更低的优先级。此外,如果某个线程被阻塞了很长时间,最好让它具有比最近运行的线程更高的优先级。
Some jobs are more important than others. For example, a system thread that performs minor housekeeping chores such as garbage collecting unused temporary files might be given lower priority than a thread that runs a user program. In addition, if a thread has been blocked for a long time, it might be better to give it higher priority over threads that have run recently.
调度程序可以使用优先级调度策略来实现此类策略,该策略为每个作业分配一个优先级编号。调度程序选择优先级编号最高的作业。调度程序必须有一些规则来打破平局,但规则是什么并不重要,只要它不会始终偏向某项作业而偏向另一项作业即可。
A scheduler can implement such policies using a priority scheduling policy, which assigns each job a priority number. The dispatcher selects the job with the highest priority number. The scheduler must have some rule to break ties, but it doesn’t matter much what the rule is, as long as it doesn’t consistenly favor one job over another.
调度程序可以以多种不同方式分配优先级编号。调度程序可以使用预定义的分配(例如,系统作业具有优先级 1,而用户作业具有优先级 0),或者可以使用系统设计者提供的策略函数来计算优先级。或者调度程序可以动态计算优先级。例如,如果线程已经等待运行很长时间,调度程序可以暂时提高该线程作业的优先级编号。例如,可以使用这种方法来避免最短作业优先策略的饥饿问题。
A scheduler can assign priority numbers in many different ways. The scheduler could use a predefined assignment (e.g., systems jobs have priority 1, and user jobs have priority 0) or the priority could be computed using a policy function provided by the system designer. Or the scheduler could compute priorities dynamically. For example, if a thread has been waiting to run for a long time, the scheduler could temporarily boost the priority number of the thread’s job. This approach can be used, for example, to avoid the starvation problem of the shortest-job-first policy.
优先级调度程序可以是抢占式的,也可以是非抢占式的。在抢占式版本中,当高优先级作业在低优先级作业运行时进入时,调度程序可能会抢占低优先级作业并立即启动高优先级作业。例如,中断可能会通知高优先级线程。当中断处理程序调用NOTIFY时,抢占式线程管理器可能会运行调度程序,这可能会中断正在运行低优先级作业的其他处理器。非抢占式版本不会在中断时进行任何重新调度或抢占,因此低优先级作业将运行至完成;当它调用AWAIT时,调度程序将切换到新近可运行的高优先级作业。
A priority scheduler may be preemptive or non-preemptive. In the preemptive version, when a high-priority job enters while a low-priority job is running, the scheduler may preempt the low-priority job and start the high-priority job immediately. For example, an interrupt may notify a high-priority thread. When the interrupt handler calls NOTIFY, a preemptive thread manager may run the scheduler, which may interrupt some other processor that is running a low-priority job. The non-preemptive version would not do any rescheduling or preemption at interrupt time, so the low-priority job would run to completion; when it calls AWAIT, the scheduler will switch to the newly runnable high-priority job.
随着我们使调度程序变得更加复杂,我们必须警惕不同调度程序之间令人惊讶的交互。例如,如果提供优先级的线程管理器设计不周,则最高优先级的线程可能会获得最少的处理器时间。侧栏 6.8解释了优先级反转,并描述了这种陷阱。
As we make schedulers more sophisticated, we have to be on the alert for surprising interactions among different schedulers. For example, if a thread manager that provides priorities isn’t carefully designed, it is possible that the highest priority thread obtains the least amount of processor time. Sidebar 6.8, which explains priority inversion, describes this pitfall.
边栏 6.8 优先级反转
Sidebar 6.8 Priority Inversion
优先级反转是设计具有优先级的调度程序时常见的陷阱。考虑一个实现抢占式优先级调度策略的线程管理器。假设我们有三个线程,T 1、T 2和 T 3,线程 T 1和 T 3共享一个锁l,该锁序列化对共享资源的引用。线程 T 1具有低优先级 (1),线程 T 2具有中等优先级 (2),线程 T 3具有高优先级 (3)。
Priority inversion is a common pitfall in designing a scheduler with priorities. Consider a thread manager that implements a preemptive, priority scheduling policy. Let’s assume we have three threads, T1, T2, and T3, and threads T1 and T3 share a lock l that serializes references to a shared resource. Thread T1 has a low priority (1), thread T2 has a medium priority (2), and thread T3 has a high priority (3).
下面的时序图显示了导致高优先级线程T3无限期延迟而中优先级线程 T2 连续获得处理器的一系列事件。
The following timing diagram shows a sequence of events that causes the high-priority thread T3 to be delayed indefinitely while the medium priority thread T2 receives the processor continuously.
假设 T 2和 T 3不可运行;例如,它们正在等待 I/O 操作完成。调度程序将调度 T 1,并且 T1 获取锁l。现在 I/O 操作完成,I/O 中断处理程序通知 T 2和 T 3。调度程序选择 T 3,因为它具有最高优先级。T 3运行一小段时间,直到它尝试获取锁l ,但是由于 T 1已经持有该锁,因此 T 3必须等待。由于 T 2可运行且优先级高于 T 1,因此线程调度程序将选择 T 2。T 2可以无限期地计算;当 T 2的时间量用完时,调度程序将找到两个可运行的线程:T 1和 T 2。它将选择 T 2 ,因为 T 2的优先级高于 T 1。只要 T 2不调用 wait,T 2就会保留处理器。只要 T 2处于可运行状态,调度程序就不会运行 T 1,因此 T 1将无法释放锁,而高优先级线程 T 3将无限期等待。这种不良现象称为优先级反转。
Let’s assume that T2 and T3 are not runnable; for example, they are waiting for an I/O operation to complete. The scheduler will schedule T1, and T1 acquires lock l. Now the I/O operation completes, and the I/O interrupt handler notifies T2 and T3. The scheduler chooses T3 because it has the highest priority. T3 runs for a short time until it tries to acquire lock l, but because T1 already holds that lock, T3 must wait. Because T2 is runnable and has higher priority than T1, the thread scheduler will select T2. T2 can compute indefinitely; when T2’s time quantum runs out, the scheduler will find two threads runnable: T1 and T2. It will select T2 because T2 has a higher priority than T1. As long as T2 doesn’t call wait, T2 will keep the processor. As long as T2 is runnable, the scheduler won’t run T1, and thus T1 will not be able to release the lock and T3, the high priority thread, will wait indefinitely. This undesirable phenomenon is known as priority inversion.
这个特定示例的解决方案很简单。当 T 3在获取锁l时阻塞时,它应该暂时将其优先级借给锁的持有者(有时称为优先级继承)——在本例中为 T 1。使用此解决方案,T 1将代替 T 2运行,并且一旦 T 1释放锁,其优先级将返回到其正常的低值,并且 T 3将运行。本质上,这个示例是交互调度程序之一。线程管理器调度处理器并锁定对共享资源的调度引用。设计计算机系统的一个挑战是识别调度程序并理解它们之间的交互。
The solution to this specific example is simple. When T3 blocks on acquiring lock l, it should temporarily lend its priority to the holder of the lock (sometimes called priority inheritance)—in this case, T1. With this solution, T1 will run instead of T2, and as soon as T1 releases the lock its priority will return to its normal low value and T3 will run. In essence, this example is one of interacting schedulers. The thread manager schedules the processor and locks schedule references to shared resources. A challenge in designing computer systems is recognizing schedulers and understanding the interactions between them.
实时系统、数据库和操作系统社区的研究人员已经“发现”了这个问题和解决方案,并且目前已有大量文献记载。然而,很容易陷入优先级反转陷阱。例如,1997 年 7 月,火星探路者号航天器在火星上经历了整个系统重置,导致收集到的实验数据丢失。软件工程师将重置的原因追溯到优先级反转问题*。
The problem and solution have been “discovered” by researchers in the real-time system, database, and operating system communities, and are well documented by now. Nevertheless, it is easy to fall into the priority inversion pitfall. For example, in July 1997 the Mars Pathfinder spacecraft experienced total systems resets on Mars, which resulted in loss of experimental data collected. The software engineers traced the cause of the resets to a priority inversion problem*.
* Mike Jones。火星上到底发生了什么?风险论坛 19,49(1997 年 12 月)。网页http://research.microsoft.com/~mbj/Mars_Pathfinder/Mars_Pathfinder.html包含更多信息,包括 Glenn Reeves 的后续研究,他领导了火星探路者软件团队
* Mike Jones. What really happened on Mars? Risks Forum 19, 49 (December 1997). The Web page http://research.microsoft.com/~mbj/Mars_Pathfinder/Mars_Pathfinder.html includes additional information, including a follow-up by Glenn Reeves, who led the software team for the Mars Pathfinder
某些应用程序具有实时约束;它们要求在指定的截止日期之前提供结果。例如,化学过程控制器可能有一个阀门必须每 10 秒打开一次,否则容器会溢出。此类应用程序使用实时调度程序来保证作业在规定的截止日期前完成。
Certain applications have real-time constraints; they require delivery of results before a specified deadline. A chemical process controller, for instance, might have a valve that must be opened every 10 seconds because otherwise a container overflows. Such applications employ real-time schedulers to guarantee that jobs complete by the stated deadline.
对于某些系统,例如化工厂、核反应堆或医院重症监护室,错过最后期限可能会导致灾难。这样的系统需要硬实时调度程序。对于这些调度程序,设计人员必须仔细确定每个作业所需的资源量,并设计完整的系统以确保所有作业都能及时处理,即使在最坏的情况下也是如此。但是,确定所需的资源量和作业所需的时间很困难。例如,具有缓存的系统有时运行作业很快(当作业的引用在缓存中命中时),有时运行很慢(当作业的引用在缓存中未命中时)。因此,硬实时系统的设计人员会通过关闭性能增强技术(例如缓存)或假设最坏情况的性能,使作业所需的时间尽可能可预测。通常,设计人员会关闭中断并轮询设备,以便他们可以仔细控制何时与设备交互。这些技术相结合增加了设计人员可以估计作业何时到达以及运行多长时间的可能性。一旦估算出每个作业所需的资源和时间量,硬实时系统的设计人员就可以计算出执行所有作业的时间表。
For some systems, such as a chemical plant, a nuclear reactor, or a hospital intensive-care unit, missing a deadline might result in disaster. Such systems require a hard real-time scheduler. For these schedulers, designers must carefully determine the amount of resources each job takes and design the complete system to ensure that all jobs can be handled in a timely manner, even in the worst case. Determining the amount of resources necessary and the time that a job takes, however, is difficult. For example, a system with a cache might sometimes run a job fast (when the job’s references hit in the cache) and sometimes slow (when the job’s references miss in the cache). Therefore, designers of hard real-time systems make the time a job takes as predictable as possible, either by turning off performance-enhancing techniques (e.g., caches) or by assuming the worst case performance. Typically, designers turn off interrupts and poll devices so that they can carefully control when to interact with a device. These techniques combined increase the likelihood that the designer can estimate when jobs will arrive and for how long they will run. Once the amount of resources and time required for each job are estimated, the designer of a hard real-time system can compute the schedule for executing all jobs.
对于其他系统(例如数字音乐系统),偶尔错过截止时间可能只是小麻烦;此类系统可以使用软实时调度程序。软实时调度程序会尝试满足所有截止时间,但不能保证;它可能会错过截止时间。例如,如果多个作业同时到达,所有作业都有 1 秒的工作时间,并且所有作业的截止时间都在 1 秒内,则除一项作业外,所有作业都将错过截止时间。软实时调度程序的目标是避免错过截止时间,但要接受当工作量超过截止时间之前完成工作的时间时可能会发生这种情况。
For other systems, such as a digital music system, missing a deadline occasionally might be just a minor annoyance; such systems can use a soft real-time scheduler. A soft real-time scheduler attempts to meet all deadlines but doesn’t guarantee it; it may miss a deadline. If, for example, multiple jobs arrive simultaneously, all have 1 second of work, and all have a deadline in 1 second, all jobs except one will miss their deadlines. The goal of a soft real-time scheduler is to avoid missing deadlines but to accept that it might happen when there is more work than there is time before the deadline to do the work.
一种避免错过截止期限的流行启发式方法是最早截止期限优先调度程序,它使作业队列按截止期限排序。调度程序运行队列中的第一个作业,该作业始终是截止期限最近的作业。大多数学生和教师都遵循这一政策:首先完成截止期限最早的作业或论文。此调度策略可最大限度地减少所有作业的总(总和)延迟。
One popular heuristic for avoiding missing deadlines is the earliest-deadline-first scheduler, which keeps the queue of jobs sorted by deadline. The dispatcher runs the first job on the queue, which is always the one with the closest deadline. Most students and faculty follow this policy: work first on the homework or paper that has the earliest deadline. This scheduling policy minimizes the total (summed) lateness of all the jobs.
对于具有一组必须定期执行的作业的软实时调度程序,我们可以开发调度算法,而不仅仅是启发式算法。具有定期作业的系统非常常见。例如,数字录像机必须每 1/30 秒处理一帧图像,以使输出看起来像电影。
For soft real-time schedulers that have a given set of jobs that must execute at periodic intervals, we can develop scheduling algorithms instead of just heuristics. Systems with periodic jobs are quite common. For example, a digital video recorder must process a picture frame every 1/30th of a second to make the output look like a movie.
要为这样的系统开发调度程序,周期性作业要完成的工作总量必须小于系统的容量。考虑一个有n 个周期性作业i的系统,这些作业的周期为P i 秒,每个作业需要C i 秒。只有在以下情况下才能处理此类系统的负载:
To develop a scheduler for such a system, the total amount of work to be done by the periodic jobs must be less than the capacity of the system. Consider a system with n periodic jobs i that happen with a period of Pi seconds and that each requires Ci seconds. The load of such a system can be handled only if:
如果总工作量在任何时候都超过系统容量,那么系统就会错过截止期限。如果总工作量小于容量,系统仍然可能偶尔错过截止期限,因为在某个短时间间隔内,要完成的总工作量大于系统的容量。例如,周期性中断可能在周期性任务必须运行的同时到达。因此,所述条件是必要条件,但不是充分条件。
If the total amount of work exceeds the system’s capacity at any time, then the system will miss a deadline. If the total amount of work is less than the capacity, the system may still miss a deadline occasionally because for some short interval of time the total amount of work to be done is greater than the capacity of the system. For example, a periodic interrupt may arrive at the same time that a periodic task must run. Thus, the condition stated is a necessary condition but not a sufficient one.
一种用于动态调度周期性作业的良好算法是速率单调调度程序。在系统的设计阶段,设计人员为每个作业分配一个与该作业发生频率成比例的优先级。例如,每 100 毫秒需要运行一次的作业的优先级为 10,而每 200 毫秒需要运行一次的作业的优先级为 5。在运行时,调度程序始终运行优先级最高的作业,必要时会抢占正在运行的作业。
A good algorithm for dynamically scheduling periodic jobs is the rate monotonic scheduler. In the design phase of the system, the designer assigns each job a priority that is proportional to the frequency of the occurrence of that job. For example, a job that needs to run every 100 milliseconds receives a priority 10, and a job that needs to run every 200 milliseconds receives a priority 5. At runtime, the scheduler always runs the highest priority job, preempting a running job if necessary.
线程调度方面已经做了很多工作,但由于处理器不再是常见的性能瓶颈,线程调度变得不那么重要了。然而,如第 6.1 节所述,磁盘臂调度很重要,因为机械磁盘臂会产生 I/O 瓶颈。磁盘臂调度程序的典型目标是优化整体吞吐量,而不是优化每个单独请求的延迟。
Much work has been done on thread scheduling, but since processors are no longer a usual performance bottleneck, thread scheduling has become less important. As explained in Section 6.1, however, disk arm scheduling is important because the mechanical disk arm creates an I/O bottleneck. The typical goal of a disk arm scheduler is to optimize overall throughput as opposed to the delay for each individual request.
当磁盘控制器从文件系统收到一批磁盘请求时,它必须决定处理这些请求的顺序。乍一看,先到先得的原则似乎是调度请求的一个很好的选择,但不幸的是,这个选择很糟糕。
When a disk controller receives a batch of disk requests from the file system, it must decide the order in which to process these requests. At first glance, it might appear that first-come first-served is a fine choice for scheduling the requests, but unfortunately that choice is a bad one.
要了解原因,请回想第 6.1 节,如果控制器移动磁盘臂,则会降低磁盘的传输速率,因为从一个轨道到另一个轨道的寻道需要时间。但是,执行寻道所需的时间取决于臂必须跨越多少个轨道。一个简单但充分的模型是,从一个轨道到另一个相距n 个轨道的寻道需要n × t秒,其中t大致为常数。
To see why, recall from Section 6.1 that if the controller moves the disk arm, it reduces the transfer rate of the disk because seeking from one track to another takes time. However, the time required to do a seek depends on how many tracks the arm must cross. A simple, but adequate, model is that a seek from one track to another track that is n tracks away takes n × t seconds, where t is roughly constant.
假设磁盘控制器位于 0 号轨道,它收到四个请求,要求寻道 0 号轨道(最内层轨道)、90 号轨道、5 号轨道和 100 号轨道(最外层轨道)。如果磁盘控制器按照收到的顺序(先到先得)执行这四个请求,那么它将首先寻道 0 号轨道,然后寻道 90 号轨道,再返回 5 号轨道,然后向前寻道 100 号轨道,总寻道延迟为 270 t:
Consider a disk controller that is on track 0 and receives four requests that require seeks to the tracks 0 (the innermost track), 90, 5, and 100 (outermost track). If the disk controller performs the four requests in the order in which it received them (first-come first-served), then it will seek first to track 0, then to 90, back to 5, and then forward to 100, for a total seek latency of 270t:
| 要求 | 移动 | 时间 |
| 寻找 1 | 0 → 0 | 0吨 |
| 寻找 2 | 0 → 90 | 90吨 |
| 寻找 3 | 90 → 5 | 85吨 |
| 寻找 4 | 5 → 100 | 95吨 |
| 全部的 | 270吨 | |
一个更好的算法是按轨道号对请求进行排序,然后按排序顺序处理它们。该算法的总寻道延迟为 100 t:
A much better algorithm is to sort the requests by track number and process them in the sorted order. The total seek latency for that algorithm is 100t:
| 要求 | 移动 | 时间 |
| 寻找 1 | 0 → 0 | 0吨 |
| 寻找 2 | 0 → 5 | 5吨 |
| 寻找 3 | 5 → 90 | 85吨 |
| 寻找 4 | 90 → 100 | 10吨 |
| 全部的 | 100吨 | |
实际上,磁盘调度算法更为复杂,因为当磁盘控制器正在处理一组请求时,新请求到达。例如,如果磁盘控制器按磁道号(0、5、90 和 100)的顺序处理请求,它完成了 5 个磁道,并收到对 1 个磁道的新请求,那么它接下来应该执行哪个请求?它可以返回并执行 1,也可以继续执行 90 和 100。第一种选择是一种称为最短寻道优先的算法;第二种选择称为电梯算法,以许多电梯在建筑物中将人们从一层运送到另一层时所执行的算法命名。使用最短寻道优先,总寻道时间为 108 t:
In practice, disk scheduling algorithms are more complex because new requests arrive while the disk controller is working on a set of requests. For example, if the disk controller is working on requests in the order of track number (0, 5, 90, and 100), it finishes 5, and receives a new request for track 1, which request should it perform next? It can go back and perform 1, or it can keep going and perform 90 and 100. The first choice is an algorithm that is called shortest seek first; the second choice is called the elevator algorithm, named after the algorithm that many elevators execute to transport people from floor to floor in buildings. With shortest-seek-first, the total seek time is 108t:
| 要求 | 移动 | 时间 |
| 寻找 1 | 0 → 0 | 0吨 |
| 寻找 2 | 0 → 5 | 5吨 |
| 寻找 3 | 5 → 1 | 4吨 |
| 寻找 4 | 1 → 90 | 89吨 |
| 寻找 5 | 90 → 100 | 10吨 |
| 全部的 | 108吨 | |
With the elevator algorithm, the total seek latency is 199t:
| 要求 | 移动 | 时间 |
| 寻找 1 | 0 → 0 | 0吨 |
| 寻找 2 | 0 → 5 | 5吨 |
| 寻找 3 | 5 → 90 | 85吨 |
| 寻找 4 | 90 → 100 | 10吨 |
| 寻找 5 | 100 → 1 | 99吨 |
| 全部的 | 199吨 | |
许多磁盘控制器使用最短寻道优先算法和电梯算法的组合。在处理请求时,它们会暂时使用最短寻道算法来选择请求,从而最大限度地缩短寻道时间,但随后会切换到电梯算法,以避免对较远磁道的请求不足。例如,如果控制器首先执行对磁道 1 的请求,开始向 90 的方向寻道,但在磁道 5 处又出现了对磁道 1 的另一个请求,那么最短寻道优先算法将返回到磁道 1。由于此事件序列可能会永远重复,因此磁盘控制器可能永远不会满足对磁道 90 和 100 的请求。通过限制磁盘控制器执行最短寻道优先的时间,然后切换到电梯算法,对较远磁道的请求也将得到满足。这种方法适用于磁盘系统,因为主要目标是最大化总吞吐量,因此延迟一个请求而不是另一个请求是可以接受的。然而,在建筑物中,人们不希望有长时间的延迟,因此对于建筑物来说,电梯算法更好。
Many disk controllers use a combination of the shortest-seek-first algorithm and the elevator algorithm. When processing requests, for a while they use the shortest-seek algorithm to choose requests, minimizing seek time, but then switch to the elevator algorithm to avoid starving requests for more distant tracks. For example, if the controller performs the request for track 1 first, starts seeking into the direction of 90, but at track 5 another request for track 1 comes in, then shortest-seek-first would go back to track 1. Since this sequence of events may repeat forever, the disk controller may never serve the request for tracks 90 and 100. By bounding the time that disk controllers perform shortest-seek-first and then switching to the elevator algorithm, requests for the distant tracks will also be served. This method is fine for disk systems, since the primary objective is to maximize total throughput, and thus delaying one request over another is acceptable. In a building, however, people do not want to have long delays, and therefore for buildings the elevator algorithm is better.
6.1 假设某处理器的时钟频率为100兆赫,从缓存中检索一个字所需的时间为1纳秒,而检索缓存中不存在的一个字所需的时间为101纳秒。
6.1 Suppose a processor has a clock rate of 100 megahertz. The time required to retrieve a word from the cache is 1 nanosecond, and the time required to retrieve a word not in the cache is 101 nanoseconds.
6.1a确定所需的命中率,使得平均内存延迟等于处理器周期时间。
6.1a Determine the hit rate needed such that the average memory latency equals the processor cycle time.
1988–1–4a
1988–1–4a
6.1b保持相同的内存设备,但考虑具有更高时钟频率的处理器,最大有用的时钟频率是多少,使得平均内存延迟等于处理器周期时间,并且它对应的命中率是多少?
6.1b Keeping the same memory devices but considering processors with a higher clock rate, what is the maximum useful clock rate such that the average memory latency equals the processor cycle time, and to what hit rate does it correspond?
1988–1–4b
1988–1–4b
6.2 某个特定程序使用 100 个数据对象,每个对象长度为 10 5字节。这些对象使用 LRU 页面替换策略连续分配在两级内存系统中,快速内存为 10 6字节,页面大小为 10 3字节。该程序始终对一个对象中随机选择的字节进行 1,000 次访问,然后转到另一个随机选择的对象(有 0.01 的概率,它可能是同一个对象),在那里对随机选择的字节进行 1,000 次访问,依此类推。
6.2 A particular program uses 100 data objects, each 105 bytes long. The objects are contiguously allocated in a two-level memory system using the LRU page replacement policy with a fast memory of 106 bytes and a page size of 103 bytes. The program always makes 1,000 accesses to randomly selected bytes in one object, then moves on to another randomly selected object (with probability 0.01 it could be the same object), makes 1,000 accesses to randomly selected bytes there, and so on.
6.2a忽略获取指令可能需要的任何内存访问,如果程序运行足够长的时间以达到平衡状态,命中率是多少?
6.2a Ignoring any memory accesses that might be needed for fetching instructions, if the program runs long enough to reach an equilibrium state, what will the hit ratio be?
1987–1–5a
1987–1–5a
6.2b如果将页面大小从103个字更改为 104 个字,而其他所有内存参数均保持不变,命中率会上升还是下降?
6.2b Will the hit ratio go up or down if the page size is changed from 103 words to 104 words, with all other memory parameters unchanged?
1987–1–5b
1987–1–5b
6.3 OutTel 公司已经向计算机行业提供 j786 微处理器一段时间了,而 Metoo 系统公司决定加入这一行列,制造一种名为“clone7861”的微处理器,它与 j786 的不同之处在于提供了两倍的处理器寄存器。Metoo 模拟了许多程序,并得出结论,这一变化将加载和存储到内存的次数平均减少了 30%,因此应该可以提高性能,当然,前提是所有程序(包括其流行的微内核操作系统)都经过重新编译以利用额外的寄存器。为什么 Metoo 会发现性能改进低于他们的模拟预测?如果有多个原因,哪一个最有可能降低性能?
6.3 OutTel corporation has been delivering j786 microprocessors to the computer industry for some time, and Metoo systems has decided to get into the act by building a microprocessor called the “clone7861”, which differs from the j786 by providing twice as many processor registers. Metoo has simulated many programs and concluded that this one change reduces the number of loads and stores to memory by an average of 30%, and thus should improve performance, assuming of course that all programs—including its popular microkernel operating system—are recompiled to take advantage of the extra registers. Why might Metoo find the performance improvement to be less than their simulations predict? If there is more than one reason, which one is likely to reduce performance the most?
1994–1–6
1994–1–6
6.4 Mike R. Kernel 正在设计 OutTel P97 计算机系统,该系统目前在硬件中有一个页表。对该设计的首次测试表明,在单个应用程序上性能极佳,但在多个应用程序上性能却很糟糕。建议对 Mike 的系统进行三项设计更改,以提高多个应用程序的性能,并简要说明您的选择。您无法更改处理器速度,但系统的任何其他方面都可以。
6.4 Mike R. Kernel is designing the OutTel P97 computer system, which currently has one page table in hardware. The first tests with this design show excellent performance with one application, but with multiple applications, performance is awful. Suggest three design changes to Mike’s system that would improve performance with multiple applications, and explain your choices briefly. You cannot change processor speed, but any other aspect of the system is fair game.
1996–1–3
1996–1–3
6.5 Ben Bitdiddle 对远程过程调用非常感兴趣,他自己实现了一个 RPC 包。他的实现为每个到达的请求启动一个新的服务线程。该线程执行请求中指定的操作,发送回复,然后终止。在测量了 RPC 性能后,Ben 认为需要进行一些改进,因此 Ben 想出了一个强力解决方案:他购买了一个速度更快的网络。新网络的传输时间只有以前的一半。Ben 在新网络上测量了小型 RPC(即每个 RPC 消息仅包含几个字节的数据)的性能。令他惊讶的是,性能几乎没有提高。他的 RPC 速度没有提高一倍的原因可能是什么?
6.5 Ben Bitdiddle gets really excited about remote procedure call and implements an RPC package himself. His implementation starts a new service thread for each arriving request. The thread performs the operation specified in the request, sends a reply, and terminates. After measuring the RPC performance, Ben decides that it needs some improvement, so Ben comes up with a brute-force solution: he buys a much faster network. The transit time of the new network is half as large as it was before. Ben measures the performance of small RPCs (meaning that each RPC message contains only a few bytes of data) on the new network. To his surprise, the performance is barely improved. What might be the reason that his RPCs are not twice as fast?
1995–1–5世纪
1995–1–5c
6.6 为什么增加虚拟内存系统的页面大小可以提高性能? 为什么增加虚拟内存系统的页面大小可能会降低性能?
6.6 Why might increasing the page size of a virtual memory system increase performance? Why might increasing the page size of a virtual memory system decrease performance?
1993–2–4a
1993–2–4a
6.7 Ben Bitdiddle 和 Louis Reasoner 正在检查一个 3.5 英寸磁盘,该磁盘的转速为 7,500 RPM,平均寻道时间为 6.5 毫秒,数据传输率为每秒 10 兆字节。扇区包含 512 字节的用户数据。
6.7 Ben Bitdiddle and Louis Reasoner are examining a 3.5-inch magnetic disk that spins at 7,500 RPM, with an average seek time of 6.5 milliseconds and a data transfer rate of 10 megabytes per second. Sectors contain 512 bytes of user data.
6.7a平均而言,当随机选择起始扇区时,读取八个连续扇区的块需要多长时间?
6.7a On average, how long does it take to read a block of eight contiguous sectors when the starting sector is chosen at random?
6.7b假设操作系统在 RAM 中维护一个 1MB 的缓存来保存磁盘扇区。此缓存的延迟为 25 纳秒,对于块传输,从缓存到 RAM 中不同位置的数据传输速率为每秒 160MB。解释这两个规范如何同时成立。
6.7b Suppose that the operating system maintains a one-megabyte cache in RAM to hold disk sectors. The latency of this cache is 25 nanoseconds, and for block transfers the data transfer rate from the cache to a different location in RAM is 160 megabytes per second. Explain how these two specifications can simultaneously be true.
6.7c给出一个公式,计算读取 100 个随机选择的磁盘扇区的预期时间,假设磁盘块缓存的命中率为 h。
6.7c Give a formula that tells the expected time to read 100 randomly chosen disk sectors, assuming that the hit ratio of the disk block cache is h.
6.7dBen 的工作站有 256 兆字节的 RAM。为了提高缓存命中率,Ben 将磁盘扇区缓存重新配置为远大于 1 兆字节。令他惊讶的是,他发现他的许多应用程序现在运行得更慢了,而不是更快了。Ben 可能忽略了什么?
6.7d Ben’s workstation has 256 megabytes of RAM. To increase the cache hit ratio, Ben reconfigures the disk sector cache to be much larger than one megabyte. To his surprise he discovers that many of his applications now run slower rather than faster. What has Ben probably overlooked?
6.7e路易斯拆开了磁盘机,看看它是如何工作的。他想起生物实验室里的离心机以 36,000 RPM 的速度运行,于是他想出了一个减少磁盘旋转延迟的好主意。他建议将转速提高到 96,000 RPM。他计算出旋转时间现在为 625 微秒。本说这个想法太疯狂了。请解释一下本的担忧。
6.7e Louis has disassembled the disk unit to see how it works. Remembering that the centrifuge in the biology lab runs at 36,000 RPM, he has come up with a bright idea on how to reduce the rotational latency of the disk. He suggests speeding it up to 96,000 RPM. He calculates that the rotation time will now be 625 microseconds. Ben says this idea is crazy. Explain Ben’s concern.
1994–3–1
1994–3–1
6.8 Ben Bitdiddle 提出了简单、整洁且强大的文件系统 (SNARFS)。* Ben 的系统除了磁盘块本身之外没有其他磁盘数据结构,这些磁盘块是自描述的。 每个 4 千字节的磁盘块都以以下 24 字节信息开头:
6.8 Ben Bitdiddle has proposed the simple neat and robust file system (SNARFS).* Ben’s system has no on-disk data structures other than the disk blocks themselves, which are self-describing. Each 4-kilobyte disk block starts with the following 24 bytes of information:
fid(文件 ID):一个 64 位数字,用于唯一定义文件。fid为零表示磁盘块是空闲的。
fid (File-ID): a 64-bit number that uniquely defines a file. A fid of zero implies that the disk block is free.
sn(序列号):一个 64 位数字,用于标识该磁盘块包含文件的哪个块。
sn (Sequence Number): a 64-bit number that identifies which block of a file this disk block contains.
此外,文件的第一个块包含文件名(字符串)、版本号及其父目录的fid。第一个块的其余部分填充数据。将目录fid设置为零表示整个文件空闲。
In addition, the first block of a file contains the file name (string), version number, and the fid of its parent directory. The rest of the first block is filled with data. Setting the directory fid to zero marks the entire file free.
目录只是文件。每个目录应该只包含其父目录的fid 。但是,作为“提示”,目录还可以包含一个表,其中给出了目录中某些文件从名称到fid的映射以及从fid到块的映射。
Directories are just files. Each directory should contain only the fid of its parent directory. However, as a “hint” directories may also include a table giving the mapping from name to fid and the mapping from fid to blocks for some of the files in the directory.
为了实现快速访问,每次启动系统时都会创建三个内存(虚拟内存)结构:
To allow fast access, three in-memory (virtual memory) structures are created each time the system is booted:
MAP:内存中的哈希表,将( fid , sn )对与包含该文件的该块的磁盘块关联起来
MAP: an in-memory hash table that associates a (fid, sn) pair with the disk block containing that block of that file
FREE: a free list that represents all of the free blocks on disk in a compact manner
RECYCLE:可重复使用但尚未写入的块列表,其fid为 0
RECYCLE: a list of blocks that are available for reuse but have not yet been written with a fid of 0
6.8a每次读取或写入磁盘块都会产生一次磁盘 I/O。SNARFS 中,在现有目录中创建包含 2 千字节数据的新文件所需的最少磁盘 I/O 数量是多少?如果系统在这些 I/O 完成后崩溃(即虚拟内存的内容丢失),则恢复后该文件应该存在于相应的目录中。
6.8a Each read or write of a disk block results in one disk I/O. What is the minimum number of disk I/Os required in SNARFS to create a new file containing 2 kilobytes of data in an existing directory? If the system crashes (i.e., the contents of virtual memory are lost) after these I/Os are completed, the file should be present in the appropriate directory after recovery.
6.8bBen 认为,内存结构在崩溃后可以轻松重建。解释在启动时重建MAP、FREE和RECYCLE需要执行哪些操作。
6.8b Ben argues that the in-memory structures can easily be rebuilt after a crash. Explain what actions are required to rebuild MAP, FREE, and RECYCLE at boot time.
1995–3–4a…c
1995–3–4a…c
6.9 Ben Bitdiddle 编写了一个“风格检查器”,旨在发现技术论文中的写作问题。该程序每次挑选一个句子,进行一段时间的密集计算,将其解析为名词、动词等,然后在庞大的语言规则数据库中查找结果模式。该数据库非常大,因此必须将其放在远程服务上,并使用远程过程调用进行查找。
6.9 Ben Bitdiddle has written a “style checker” intended to uncover writing problems in technical papers. The program picks up one sentence at a time, computes intensely for a while to parse it into nouns, verbs, and the like, and then looks up the resulting pattern in an enormous database of linguistic rules. The database was so large that it was necessary to place it on a remote service and do the lookup with a remote procedure call.
Ben 苦恼地发现 RPC 的延迟太长,导致他的样式检查器运行速度比他希望的要慢得多。他想知道在客户端添加多个线程是否可以加快他的程序速度。
Ben is distressed to find that the RPCs have such a long latency that his style checker runs much more slowly than he hoped. He wonders if adding multiple threads to the client could speed up his program.
6.9aBen 的检查器在单处理器工作站上运行。解释多个客户端线程如何减少分析技术论文的时间。
6.9a Ben’s checker is running on a single-processor workstation. Explain how multiple client threads could reduce the time to analyze a technical paper.
6.9bBen 实现了一个多线程样式检查器,并使用不同数量的线程运行了一系列实验。他发现,当他添加第二个线程时,性能确实有所提高,当他添加第三个线程时,性能再次得到改善。但他发现,随着他添加越来越多的线程,额外的性能改进逐渐减少,最终添加更多线程会导致性能下降。请解释这种行为。
6.9b Ben implements a multithreaded style checker and runs a series of experiments with various numbers of threads. He finds that performance does indeed improve when he adds a second thread and again when he adds a third. But he finds that as he adds more and more threads the additional performance improvement diminishes, and finally adding more threads leads to reduced performance. Give an explanation for this behavior.
6.9c建议一种无需引入线程即可提高样式检查器性能的方法。(Ben 只被允许更改客户端。)
6.9c Suggest a way of improving the style checker’s performance without introducing threads. (Ben is allowed to change only the client.)
1994–1–4a…c
1994–1–4a…c
6.10 新型多线程 Web 浏览器中的线程会定期查询附近的万维网服务器以检索文档。平均而言,浏览器线程每N条指令执行一次查询。每次对服务器的请求在返回答案之前平均需要T毫秒的往返时间。
6.10 Threads in a new multithreaded Web browser periodically query a nearby World Wide Web server to retrieve documents. On average, a browser’s thread performs a query every N instructions. Each request to the server incurs an average round-trip time of T milliseconds before the answer returns.
6.10a对于N = 2,000 条指令和T = 1 毫秒,要使单个每秒 1 亿条指令 (MIPS) 处理器保持 100% 繁忙,至少需要多少个这样的线程?假设线程之间的上下文切换是即时的,并且调度程序是最佳的。
6.10a For N 5 2,000 instructions and T 5 1 millisecond, what is the smallest number of such threads that would be required to keep a single 100 million instructions every second (MIPS) processor 100% busy? Assume that the context switch between threads is instantaneous and that the scheduler is optimal.
6.10b但上下文切换不是瞬时的。假设上下文切换需要C指令才能完成。重新计算 6.10a 中C 5 500 条指令的答案。
6.10b But context switches are not instantaneous. Assume that a context switch takes C instructions to perform. Recompute the answer to 6.10a for C 5 500 instructions.
6.10c应用程序线程的哪些属性可能导致第 6.10a 和 6.10b 部分的答案不正确?也就是说,为什么可能需要更多线程来保持运行浏览器的处理器忙碌?
6.10c What property of the application threads might cause the answers of parts 6.10a and 6.10b to be incorrect? That is, why might more threads be required to keep the processor running the browser busy?
6.10d实际计算机系统的哪些特性可能导致 6.10a 和 6.10b 的答案严重高估?
6.10d What property of the actual computer system might make the answers of 6.10a and 6.10b gross overestimates?
1995–1–4a…d
1995–1–4a…d
6.11 与直接实现 LRU 相比,使用时钟算法有哪些优势?
6.11 What are the advantages of using the clock algorithm as compared with implementing LRU directly?
A. Only a single bit per object or page is required.
B. Clock is more efficient to execute.
C. The first object or page to be purged is the most recently used one.
2001–1–4
2001–1–4
6.12 Louis Reasoner 发现第 6.2.9 节中提到的预分页系统非常有趣,因此他设计了一个使用预分页的 OPT 版本。以下是 Louis 的预分页 OPT 的描述:
6.12 Louis Reasoner found the mention of prepaging systems in Section 6.2.9 to be so intriguing that he has devised a version of OPT that uses prepaging. Here is a description of Louis’s prepage-OPT:
了解引用字符串后,创建页面的全排序,其中每个页面的顺序与应用程序下次引用的顺序一致。然后,将堆栈前端预分页到主内存中。
Knowing the reference string, create a total ordering of the pages in which each page is in the order in which the application will next make reference to it. Then, prepage the front of the stack into the primary memory.
每次引用页面后,重新排列顺序,使每个页面再次按照应用程序下次引用的顺序排列。因此,与 LRU(保持自最近使用以来的顺序)相比,prepage-OPT 保持下次使用的顺序。
After each page reference, rearrange the ordering so that every page is again in the order in which the application will next make reference to it. Thus, in contrast with LRU, which maintains an ordering since most recent use, prepage-OPT maintains an ordering of next use.
要进行这种重新排列,需要将刚刚被触碰的一页按顺序向下移动至深度d,以便下次使用。所有深度超过d的页面都会向上移动一个位置。永远不会再使用的页面被分配无限深度并移动到堆栈底部。这种重新排列方案可确保排序的第一个页面始终是下一个要使用的页面。
To do this rearrangement requires moving exactly one page, the one that was just touched, down in the ordering to the depth d where it will next be used. All of the pages that were above depth d move up one position. A page that will never be used again is assigned a depth of infinity and moves to the bottom of the stack. This rearrangement scheme ensures that the first page of the ordering is always the page that will be used next.
如果d > m(其中m是主内存设备的大小),则第二个项目符号的操作将导致页面从辅助内存移动到主内存。由于引用字符串尚未请求此页面,因此此移动预示了未来的需求,这是预分页的另一个示例。
If d > m (where m is the size of the primary memory device), the operation of the second bullet will result in a page being moved from the secondary memory to the primary memory. Since the reference string has not yet demanded this page, this movement anticipates a future need, another example of prepaging.
6.11aprepage-OPT 是堆栈算法吗?为什么或为什么不?
6.11a Is prepage-OPT a stack algorithm? Why or why not?
6.11b对于表 6.3示例中的引用字符串,开发该表的一个版本(或表 6.8,如果更合适),以显示对于每个内存大小,使用 prepage-OPT 会发生哪些页面移动。假设运行的第一步是使用排序前面的页面预加载主内存。
6.11b For the reference string in the example of Table 6.3, develop a version of that table (or of Table 6.8 if that is more appropriate) that shows what page movements occur with prepage-OPT for each memory size. Assume that the first step of the run is to preload the primary memory with pages from the front of the ordering.
6.11cPrepage-OPT 比 demand-OPT 好还是差?
6.11c Is prepage-OPT better or worse than demand-OPT?
2006-0-1
2006-0-1
与第 6 章相关的附加练习可以在从第 425 页开始的问题集中找到。
Additional exercises relating to Chapter 6 can be found in the problem sets beginning on page 425.
* Jain 的教科书是学习排队理论和如何推理计算机系统性能的绝佳资源 [进一步阅读建议 1.1.2 ]。
* The textbook by Jain is an excellent source to learn about queuing theory and how to reason about performance in computer systems [Suggestions for Further Reading 1.1.2].
* T. Kilburn、DBJ Edwards、MJ Lanigan 和 FH Sumner。单级存储系统。IRE电子计算机学报,EC-11,2(1962 年 4 月),第 223-235 页。
* T. Kilburn, D.B.J. Edwards, M.J. Lanigan, and F.H. Sumner. One-level storage system. IRE Transactions on Electronic Computers, EC-11, 2 (April 1962), pages 223–235.
本教材的第二部分延续了第一部分的主题——强制模块化——并引入了更强大的模块化形式。第一部分介绍了客户端/服务模型和虚拟化,这两者都有助于防止一个模块中的意外错误传播到另一个模块。第二部分介绍了更强大的模块化形式,有助于防止组件和系统故障以及恶意攻击。第二部分探讨了通信网络、从不可靠的组件构建可靠的系统、创建全有或全无和前后事务以及实施安全性。在此过程中,第二部分还介绍了其他设计原则,以指导需要构建具有更强模块化的计算机系统的设计人员。以下是这些[在线]章节和补充材料主题的简要摘要。此外,第一部分的目录列出了第二部分章节的章节。
Part II of this textbook continues a main theme of Part I—enforcing modularity—by introducing still stronger forms of modularity. Part I introduces the client/service model and virtualization, both of which help prevent accidental errors in one module from propagating to another. Part II introduces stronger forms of modularity that can help protect against component and system failures, as well as malicious attacks. Part II explores communication networks, constructing reliable systems from unreliable components, creating all-or-nothing and before-or-after transactions, and implementing security. In doing so, Part II also introduces additional design principles to guide a designer who needs to build computer systems that have stronger modularity. Following is a brief summary of the topics of those [on-line] chapters and supplementary materials. In addition, the Table of Contents for Part I lists the sections of the Part II chapters.
第 7 章 [在线]:网络。通过在通过网络连接的不同计算机上运行客户端和服务,可以构建利用地理分离来容忍故障的计算机系统,并构建能够实现跨地理距离信息共享的系统。本章将网络作为系统案例研究,深入研究网络的内部组织方式和工作方式。在讨论网络构建方式的原因后,它介绍了一个三层模型,然后是每个层的主要部分。对拥塞控制的讨论有助于汇总各层之间交互的完整图景。本章以关于网络设计缺陷的简短战争故事集结束。
Chapter 7 [on-line]: Networks. By running clients and services on different computers that are connected by a network, one can build computer systems that exploit geographic separation to tolerate failures and construct systems that can enable information sharing across geographic distances. This chapter approaches the network as a case study of a system and digs deeply into how networks are organized internally and how they work. After a discussion that offers insight into why networks are built the way they are, it introduces a three-layer model, followed by a major section on each layer. A discussion of congestion control helps bring together the complete picture of interaction among the layers. The chapter ends with a short collection of war stories about network design flaws.
第 8 章 [在线]:容错。本章介绍了构建计算机系统的基本技术,即使组件出现故障,该系统仍可继续提供服务。它系统地开发了设计原则和技术,用于从不可靠的组件创建可靠的系统,这些原则和技术基于模块化和网络设计中使用的一些技术的通用化。本章以内存系统中容错的案例研究和一系列关于容错系统未能实现容错的战争故事结束。本章对于入门教材来说是一个不寻常的特点——这些材料如果出现在课程中,通常会留给研究生选修课——然而,几乎所有计算机系统都需要一定程度的容错能力。
Chapter 8 [on-line]: Fault tolerance. This chapter introduces the basic techniques to build computer systems that, despite component failures, continue to provide service. It offers a systematic development of design principles and techniques for creating reliable systems from unreliable components, based on modularity and on generalization of some of the techniques used in the design of networks. The chapter ends with a case study of fault tolerance in memory systems and a set of war stories about fault tolerant systems that failed to be fault tolerant. This chapter is an unusual feature for an introductory text—this material, if it appears at all in a curriculum, is usually left to graduate elective courses—yet some degree of fault tolerance is a requirement for almost all computer systems.
第 9 章 [在线]:原子性。本章讨论了在并发线程存在且系统出现故障的情况下如何完美地更新数据。它扩展了第 5 章中介绍的概念,采取了一种跨领域的原子性方法——使操作在故障和并发操作方面都是原子性的——认识到原子性是一种模块化形式,在操作系统、数据库管理和处理器设计中起着根本性的作用。本章首先为设计师如何实现原子性的直觉奠定了基础,然后介绍了一种易于理解的原子性方案。这个基础为直接解释指令重命名、事务内存、日志和两相锁定奠定了基础。一旦建立了关于如何系统地实现原子性的直觉,本章将继续展示数据库系统如何使用日志来创建全有或全无操作和自动锁定管理以确保并发操作的前后原子性。最后,本章探讨了在地理上分散的工作人员之间就是否执行原子操作达成一致的方法。本章最后介绍了处理器设计和磁盘存储管理中原子性的案例研究。
Chapter 9 [on-line]: Atomicity. This chapter deals with the problem of making flawless updates to data in the presence of concurrent threads and despite system failures. It expands on concepts introduced in Chapter 5, taking a cross-cutting approach to atomicity—making actions atomic with respect to failures and also with respect to concurrent actions—that recognizes that atomicity is a form of modularity that plays a fundamental role in operating systems, database management, and processor design. The chapter begins by laying the groundwork for intuition about how a designer achieves atomicity, and then it introduces an easy-to-understand atomicity scheme. This basis sets the stage for straightforward explanations of instruction renaming, transactional memory, logs, and two-phase locking. Once an intuition is established about how to systematically achieve atomicity, the chapter goes on to show how database systems use logs to create all-or-nothing actions and automatic lock management to ensure before-or-after atomicity of concurrent actions. Finally, the chapter explores methods of obtaining agreement among geographically separated workers about whether or not to commit an atomic action. The chapter ends with case studies of atomicity in processor design and management of disk storage.
第 10 章 [在线]:一致性。本章讨论了在复制数据以提高性能、可用性或持久性时出现的各种要求:缓存一致性、用于延长持久性的副本管理以及通常断开连接的数据库的协调(例如,个人数字助理或手机与台式计算机的“热同步”)。本章向读者介绍了这些要求以及用于满足这些要求的基本机制。有时这些主题被标记为“分布式系统”。
Chapter 10 [on-line]: Consistency. This chapter discusses a variety of requirements that show up when data is replicated for performance, availability, or durability: cache coherence, replica management for extended durability, and reconciliation of usually disconnected databases (e.g., “hotsync” of a personal digital assistant or cell phone with a desktop computer). The chapter introduces the reader to the requirements and the basic mechanisms used to meet those requirements. Sometimes these topics are identified with the label “distributed systems”.
第 11 章 [在线]:安全性。前面的章节逐渐介绍了更强大、影响更深远的模块化实施方法。本章通过介绍确保即使在恶意对手面前也能实施模块化的技术,将实施水平提升到最高水平。它从设计原则和安全模型开始,然后将该模型应用于内部模块化边界的实施(传统上称为“保护”)和网络安全。高级主题部分解释了加密技术,这是大多数网络安全的基础。安全套接字层 (SSL) 协议的案例研究和一系列保护系统故障的实战故事说明了实现安全性所涉及的考虑范围和微妙性。
Chapter 11 [on-line]: Security. Earlier chapters gradually introduced more powerful and far-reaching methods of enforcing modularity. This chapter cranks up the enforcement level to maximum strength by introducing the techniques of ensuring that modularity is enforced even in the face of adversaries who behave malevolently. It starts with design principles and a security model, and it then applies that model both to enforcement of internal modular boundaries (traditionally called “protection”) and to network security. An advanced topics section explains cryptographic techniques, which are the basis for most network security. A case study of the Secure Socket Layer (SSL) protocol and a set of war stories of protection system failures illustrate the range and subtlety of considerations involved in achieving security.
进一步阅读的建议。第二部分的建议阅读清单除了更新内容外,与本书的相同。
Suggestions for further reading. The suggested reading list in Part II is, apart from updates, the same as the one in this book.
问题集。第二部分的问题集包括第一部分的问题集和第二部分各章的许多附加问题集。
Problem sets. The Part II collection of problem sets includes both the Part I problem sets and many additional problem sets for the Part II chapters.
词汇表。在线词汇表与本书中的词汇表相同。除了主要支持本教科书之外,在线词汇表还可以作为参考资料,其他专业的工作人员可能会发现它有助于协调他们的术语与系统领域的术语。
Glossary. The on-line Glossary is identical to the one in this book. In addition to its primary purpose of supporting this textbook, the on-line Glossary also can serve as a reference source that workers in other specialties may find useful in coordinating their terminology with that of the field of systems.
综合索引。在线概念索引以单一字母列表的形式提供了第一部分和第二部分的页码。
Comprehensive Index. The on-line Index of Concepts provides page numbers for both Part I and Part II in a single alphabetic list.
当我们希望将一组事物分为两类(称为“ In”和“ Out ”),但没有直接的分类方法时,就会出现二元分类权衡。另一方面,对于那些相对容易分类的事物,有一个代理。问题是代理只是近似的。因为它只是近似的,所以有四种分类结果:
A binary classification trade-off arises when we wish to classify a set of things into two categories (call them In and Out), but we do not have a direct way of doing the classifying. On the other hand, there is a proxy for those things that is relatively easy to classify. The problem is that the proxy is only approximate. Because it is only approximate, there are four classification outcomes:
True positive: The proxy classifies things as In that should be In.
True negative: The proxy classifies things as Out that should be Out.
False negative: The proxy classifies things as Out that should be In.
False positive: The proxy classifies things as In that should be Out.
权衡之下,也许可以通过调整代理的某些参数来降低某个错误结果的频率,但这种调整可能会增加另一个错误结果的频率*。
The trade-off is that it may be possible to reduce the frequency of one of the false outcomes by adjusting some parameter of the proxy, but that adjustment will probably increase the frequency of the other false outcome*.
一个常见的例子是垃圾邮件过滤器,它是区分有用电子邮件和垃圾邮件的代理。过滤器大多数时候都能正确地将电子邮件分类,但偶尔会将有用邮件错误地分类为垃圾邮件,导致您可能永远看不到该邮件。它还可能将某些垃圾邮件错误地分类为有用电子邮件,导致垃圾邮件充斥您的邮箱。当有人试图调整垃圾邮件过滤器时,就会出现这种权衡。如果过滤器变得更加激进,更多有用电子邮件可能会被错误分类为垃圾邮件。如果垃圾邮件过滤器变得不那么激进,更多的垃圾邮件可能会进入您的邮箱。
A common example is an e-mail spam filter, which is a proxy for the division between wanted e-mail and spam. The filter correctly classifies e-mail most of the time, but it occasionally misclassifies a wanted message as spam, with the undesirable outcome that you may never see that message. It may also misclassify some spam as wanted e-mail, with the undesirable outcome that the spam clutters up your mailbox. The trade-off appears when someone tries to adjust the spam filter. If the filter becomes more aggressive, more wanted e-mail is likely to end up misclassified as spam. If the spam filter becomes less aggressive, more spam is likely to end up in your mailbox.
同时减少两种不良结果通常需要发现更好的代理,但更好的代理可能很难找到,甚至根本不存在。
Reducing both undesirable outcomes simultaneously usually requires discovering a better proxy, but a better one may be hard to find or may not exist at all.
表示:可以方便地用 2 × 2 矩阵(如下页所示)表示二元分类权衡,只需回答两个问题:(1)什么是真实类别?(2)什么是代理类别?该示例描述了一个烟雾探测器。真实类别是{火灾,无火灾}。代理类别是{烟雾探测器发出信号,烟雾探测器无声}。过于敏感的烟雾探测器可能会发出更多误报,但不敏感的烟雾探测器可能会错过更多真实火灾。当有人用实际事件的数量替换标签时,这种表示称为混淆矩阵。
Representations: One can conveniently represent a binary classification trade-off with a 2 × 2 matrix such as the one on the next page by answering two questions: (1) What are the real categories? and (2) What are the proxy categories? The example describes a smoke detector. The real categories are {fire, no fire}. The proxy categories are {smoke detector signals, smoke detector is quiet}. A too-sensitive smoke detector may signal more false alarms, but an insensitive one may miss more real fires. When someone replaces the labels with numbers of actual events, this representation is called a confusion matrix.
维恩图(如下图所示)是二元分类权衡的另一种有用表示。以文档检索(例如 Google 搜索)为例,真正的类别是想要的文档和不需要的文档。代理是查询,其类别是查询匹配或查询未匹配的类别。
A Venn diagram, such as the one below, can be another useful representation of a binary classification trade-off. Take, for example, document retrieval (e.g., a Google search) The real categories are wanted and unwanted documents. The proxy is a query, for which the categories are that the query matches or the query misses.
测量:有时人们可以识别出真正的分类并将其与代理分类进行比较。如果可以,计算比率来衡量代理质量会很有用。不幸的是,可能的比率太多了。混淆矩阵包含四个数字,可以单独使用,也可以相加以 14 种方式用作分子或分母,因此可以计算出 14 × 13 = 182 个不同的比率。并非所有这些比率都很有趣,但人们通常可以在 182 个比率中找到至少一个似乎支持其在辩论中的立场的比率。
Measures: Sometimes one can identify the true categorizations and compare them with the proxy classifications. When that is possible, it can be useful to calculate ratios to measure proxy quality. Unfortunately, there are too many possible ratios. The confusion matrix contains four numbers, which may be used singly or may be added up to use as either a numerator or a denominator in 14 ways, so it is possible to calculate 14 × 13 = 182 different ratios. Not all of these ratios are interesting, but one can usually find at least one ratio among the 182 that seems to support his or her position in a debate.
其中九个比率非常流行,因此有名字,尽管其中三个只是其他已命名比率的补充。信息检索社区为这些比率使用一组标签,而医学和生物信息学社区使用另一组标签,其他社区则开发自己的命名法。正如将要看到的,所有这些标签都可能令人困惑。
Nine of these ratios are popular enough to have names, although three of the nine are just complements of other named ratios. The information retrieval community uses one set of labels for these ratios, whereas the medical and bioinformatics communities use another, with other communities developing their own nomenclature. As will be seen, all of the labels can be confusing.
假设有In + Out = N 个项目,并且我们已经运行了分类器并计算了真假阳性和假阴性的数量。以下是九个比率:
Suppose that there is a population of In + Out = N items and that we have run the classifier and counted the number of true and false positives and negatives. Here are the nine ratios:
1. Prevalance: The fraction of the population that is In.
2. Efficiency, Accuracy, or Hit Rate: The fraction of the population the proxy classifies correctly.
3.精确度(信息检索)或阳性预测值(医学):代理分类为“在”的事物中实际上“在”的分数。
3. Precision (information retrieval) or Positive Predictive Value (medical): The fraction of things that the proxy classifies as In that are actually In.
4.召回率(信息检索)、敏感度(医学)或真实接受率(生物识别):代理将总体中属于 的事物分类为的分数。
4. Recall (information retrieval), Sensitivity (medical), or True acceptance rate (biometrics): The fraction of things in the population that are In that the proxy classifies as In.
5.特异性(医学)或真实拒绝率(生物识别):代理将总体中不属于某类的事物分类为不属于某类的事物的比例。
5. Specificity (medical) or True rejection rate (biometrics): The fraction of things in the population that are Out that the proxy classifies as Out.
6.负面预测值:代理分类为Out的事物中实际上为Out 的事物的比例。
6. Negative Predictive Value: The fraction of things that the proxy classifies as Out that are actually Out.
7. Misclassification Rate or Miss Rate: The fraction of the population the proxy classifies wrong.
8. False Acceptance Rate: The fraction of Out items that are falsely classified as In.
9. False Rejection Rate: The fraction of In items that are falsely classified as Out.
1系统
1 Systems
1.1关于系统的精彩书籍
1.1 Wonderful Books About Systems
1.2关于系统的真正好书
1.2 Really Good Books About Systems
1.3 Good Books on Related Subjects Deserving Space on the Systems Bookshelf
1.4系统的思考方式
1.4 Ways of Thinking About Systems
1.5关于系统设计的智慧
1.5 Wisdom About System Design
1.6技术变革及其对系统的影响
1.6 Changing Technology and its Impact on Systems
1.7戏剧性的景象
1.7 Dramatic Visions
1.8全新面貌
1.9保持大型系统处于控制之下
2 Elements of Computer System Organization
3 The Design of Naming Schemes
4 Enforcing Modularity with Clients and Services
5 Enforcing Modularity with Virtualization
5.1内核
5.1 Kernels
5.2 Type Extension as a Modularity Enforcement Tool
5.3虚拟处理器:线程
5.3 Virtual Processors: Threads
5.4虚拟内存
5.4 Virtual Memory
5.5协调
5.5 Coordination
5.6虚拟化
5.6 Virtualization
6表现
7 The Network as a System and as a System Component
8 Fault Tolerance: Reliable Systems from Unreliable Components
9 Atomicity: All-or-Nothing and Before-or-After
10一致性和持久存储
10 Consistency and Durable Storage
11信息安全
11.1隐私
11.1 Privacy
11.2保护架构
11.3认证、可信计算机系统和安全内核
11.3 Certification, Trusted Computer Systems, and Security Kernels
11.4验证
11.4 Authentication
11.5加密技术
11.6对手(黑暗面)
四十多年来,计算机系统所依赖的硬件技术发展迅速,持续改进,系统设计的基本规则也不断发生变化。知识和经验需要经过许多年才能汇编、消化并以书籍的形式呈现,因此有关计算机系统的书籍在印刷出来时往往显得过时或陈旧。尽管一些基本原则是不变的,但细节的快速过时使潜在的书籍作者望而却步,结果一些重要的想法从未被记录在书中。出于这个原因,计算机系统研究的一个重要部分是在当前的(通常是较旧的)技术论文、专业期刊文章、研究报告和偶尔在该领域活跃的工作者中流传的未发表的备忘录中找到的。
The hardware technology that underlies computer systems has improved so rapidly and continuously for more than four decades that the ground rules for system design are constantly subject to change. It takes many years for knowledge and experience to be compiled, digested, and presented in the form of a book, so books about computer systems often seem dated or obsolete by the time they appear in print. Even though some underlying principles are unchanging, the rapid obsolescence of details acts to discourage prospective book authors, and as a result some important ideas are never documented in books. For this reason, an essential part of the study of computer systems is found in current—and, frequently, older—technical papers, professional journal articles, research reports, and occasional, unpublished memoranda that circulate among active workers in the field.
尽管有这样的警告,但还是有几本书值得收藏,它们是计算机系统文献中相对较新的内容。直到 20 世纪 80 年代中期,当时的书籍大部分是由教科书出版商委托出版的,以填补市场空白,而且它们往往强调系统的机械方面,而不是对系统的设计的洞察。然而,从 1985 年左右开始,一些非常好的书籍开始出现,当时专业的系统设计师受到启发,开始捕捉他们的见解。这些书籍的出现也表明,计算机系统设计中涉及的概念终于开始稳定下来了。(或者可能只是计算机系统技术开始缩短图书出版的延迟。)
Despite that caveat, there are a few books, relatively recent additions to the literature in computer systems, that are worth having on the shelf. Until the mid-1980s, the books that existed were for the most part commissioned by textbook publishers to fill a market, and they tended to emphasize the mechanical aspects of systems rather than insight into their design. Starting around 1985, however, several very good books started to appear, when professional system designers became inspired to capture their insights. The appearance of these books also suggests that the concepts involved in computer system design are finally beginning to stabilize a bit. (Or it may just be that computer system technology is beginning to shorten the latencies involved in book publishing.)
计算机系统文献的核心是已发表的论文。其中最好的两个来源是计算机协会 (ACM) 出版物:期刊《ACM 计算机系统学报》(TOCS)和每两年举办一次的会议论文集系列《ACM 操作系统原理研讨会》(SOSP)。每个 SOSP 的最佳论文都会发表在 TOCS 的后续期刊中,其余(近年来是全部)的论文都会发表在《操作系统评论》的特别版中,这是 ACM 特别兴趣小组的季刊,在研讨会年出版一期额外期刊。另外三个定期研讨会也值得关注:欧洲计算机系统会议 (EuroSys)、USENIX 操作系统设计和实现研讨会 (OSDI)和 USENIX 网络系统设计和实现研讨会 (NSDI)。这些来源并不是唯一的来源——关于计算机系统的有价值的论文也出现在许多其他期刊、会议和研讨会上。此处列出的大多数论文(包括许多较早的论文)的完整副本都可以通过在线搜索作者姓氏和论文标题的几个字在万维网上找到。即使是主要列表需要订阅的论文也经常在其他地方发布为开放资源。
The heart of the computer systems literature is found in published papers. Two of the best sources are Association for Computing Machinery (ACM) publications: the journal ACM Transactions on Computer Systems (TOCS) and the bi-annual series of conference proceedings, the ACM Symposium on Operating Systems Principles (SOSP). The best papers of each SOSP are published in a following issue of TOCS, and the rest—in recent years all—of the papers of each symposium appear in a special edition of Operating Systems Review, an ACM special interest group quarterly that publishes an extra issue in symposium years. Three other regular symposia are also worth following: the European Conference on Computer Systems (EuroSys), the USENIX Symposium on Operating Systems Design and Implementation (OSDI), and the USENIX Symposium on Networked Systems Design and Implementation (NSDI). These sources are not the only ones—worthwhile papers about computer systems appear in many other journals, conferences, and workshops. Complete copies of most of the papers listed here, including many of the older ones, can be found on the World Wide Web by an on-line search for an author’s last name and a few words of the paper title. Even papers whose primary listing requires a subscription are often posted elsewhere as open resources.
以下页面包含有关计算机系统的进一步阅读建议,包括论文和书籍。此列表并不声称是完整的。相反,这些建议是从大量文献中挑选出来的,以强调最佳的现有思想、问题的最佳说明和最有趣的计算机系统案例研究。这些阅读材料已经过审查,以确定是否过时,但通常情况下,一个好的想法仍然最好由一段时间前的论文来描述,而这个想法是在不再那么有趣的背景下开发的。有时,早期的背景比今天的系统简单得多,因此更容易理解这个想法是如何工作的。通常,早期的作者是第一个出现的人,因此有必要比现代作者更完整地描述事物,因为现代作者通常假设他们对周围环境和所有前身系统都非常熟悉。因此,此处包含的旧阅读材料为当前的作品提供了非常有用的补充。
The following pages contain suggestions for further reading about computer systems, both papers and books. The list makes no pretensions of being complete. Instead, the suggestions have been selected from a vast literature to emphasize the best available thinking, best illustrations of problems, and most interesting case studies of computer systems. The readings have been reviewed for obsolescence, but it is often the case that a good idea is still best described by a paper from some time ago, where the idea was developed in a context that no longer seems very interesting. Sometimes that early context is much simpler than today’s systems, thus making it easier to see how the idea works. Often, an early author was the first on the scene, so it was necessary to describe things more completely than do modern authors who usually assume significant familiarity with the surroundings and with all of the predecessor systems. Thus the older readings included here provide a very useful complement to current works.
从本质上讲,计算机系统工程研究与计算机科学的其他领域存在重叠,特别是计算机架构、编程语言、数据库、信息检索、安全和数据通信。每个领域都有大量相关文献,而且往往很难明确界定界限。一般来说,这份阅读清单仅试图提供从这些相关领域入手的初级指导。
By its nature, the study of the engineering of computer systems overlaps with other areas of computer science, particularly computer architecture, programming languages, databases, information retrieval, security, and data communications. Each of those areas has an extensive literature of its own, and it is often not obvious where to draw the boundary lines. As a general rule, this reading list tries to provide only first-level guidance on where to start in those related areas.
读者必须注意的一件事是,计算机系统领域的术语尚未达成一致,因此文献往往甚至让专业人士感到困惑。此外,文献的质量水平也参差不齐,从通俗易懂到可读性强,再到难以理解。虽然本文的选择尽量避免最后一类,但读者仍必须做好准备,因为有些论文无论内容多么重要,都不能很好地解释其主题。
One thing the reader must watch for is that the terminology of the computer systems field is not agreed upon, so the literature is often confusing even to the professional. In addition, the quality level of the literature is quite variable, ranging from the literate through the readable to the barely comprehensible. Although the selections here try to avoid that last category, the reader must still be prepared for some papers, however important in their content, that do not explain their subject as well as they could.
在随后的材料中,每篇引文都附有评论,说明该论文为何值得一读——其重要性、趣味性以及与其他阅读材料的关系。当一篇论文涉及多个领域时,会出现交叉引用,而不是重复引用。
In the material that follows, each citation is accompanied by a comment suggesting why that paper is worth reading—its importance, interest, and relation to other readings. When a single paper serves more than one area of interest, cross-references appear rather than repeating the citation.
如上所述,最近出现了一些关于计算机系统的精彩书籍和几本非常好的书籍。以下是计算机系统设计师参考书架上的必备书籍。除了这些书籍之外,后面按主题分组的阅读材料还包括其他书籍,通常兴趣范围较窄。
As mentioned above, a few wonderful and several really good books about computer systems have recently begun to appear. Here are the must-have items for the reference shelf of the computer systems designer. In addition to these books, the later groupings of readings by topic include other books, generally of narrower interest.
1.1.1David A. Patterson 和 John L. Hennessy。计算机架构:定量方法。Morgan Kaufman,第四版,2007 年。ISBN:978-0-12-370490-0。704 + 各种页面(平装本)。封面以相反的顺序给出了作者的名字。
1.1.1 David A. Patterson and John L. Hennessy. Computer Architecture: A Quantitative Approach. Morgan Kaufman, fourth edition, 2007. ISBN: 978-0-12-370490-0. 704 + various pages (paperback). The cover gives the authors’ names in the opposite order.
本书提供了一部精彩绝伦的杰作,探索了当前计算机架构的大部分设计空间。最好的特点之一是每个领域都包括对错误思想及其陷阱的讨论。尽管主题非常复杂,但这本书总是非常易读。这本书有自己的观点(强烈偏向 RISC 架构),但尽管如此,从系统角度来看,这是一本关于计算机组织的权威著作。
This book provides a spectacular tour-de-force that explores much of the design space of current computer architecture. One of the best features is that each area includes a discussion of misguided ideas and their pitfalls. Even though the subject matter gets very sophisticated, the book is always very readable. The book is opinionated (with a strong bias toward RISC architecture), but nevertheless this is a definitive work on computer organization from the system perspective.
1.1.2Raj Jain。《计算机系统性能分析艺术》。John Wiley & Sons,1991 年。ISBN:978-0-471-50336-1。720 页。
1.1.2 Raj Jain. The Art of Computer Systems Performance Analysis. John Wiley & Sons, 1991. ISBN: 978-0-471-50336-1. 720 pages.
计算机系统性能分析方面的许多工作都源自学术环境,侧重于数学上可处理的分析,而不是重要的测量。这本书则处于另一个极端。它是由一位拥有丰富行业经验但又具有解释事物的学术天赋的人撰写的。如果你有一个真正的性能分析问题,它会告诉你如何解决它,如何避免测量错误的东西,以及如何避开其他陷阱。
Much work on performance analysis of computer systems originates in academic settings and focuses on analysis that is mathematically tractable rather than on measurements that matter. This book is at the other end of the spectrum. It is written by someone with extensive industrial experience but an academic flair for explaining things. If you have a real performance analysis problem, it will tell you how to tackle it, how to avoid measuring the wrong thing, and how to step by other pitfalls.
1.1.3Frederick P. Brooks Jr. 《人月神话:软件工程论文集》。Addison-Wesley,20 周年纪念版,1995 年。ISBN:978-0-201-83595-3(平装本)。336 页。
1.1.3 Frederick P. Brooks Jr. The Mythical Man-Month: Essays on Software Engineering. Addison-Wesley, 20th Anniversary edition, 1995. ISBN: 978-0-201-83595-3 (paperback). 336 pages.
这本书写得很好,见解深刻,是迄今为止控制系统开发方面最重要的一本书。在这里,你可以了解到为什么在一个落后于计划的项目中增加更多的员工会使其进一步拖延。虽然有些章节现在有点过时了,但这里的大部分内容都是永恒的。系统开发中的问题也是永恒的,大型系统项目失败的不断报道就是明证。大多数成功的系统设计师的书架上都有这本书,有些人声称每年至少重读一次。1995 年版的大部分内容与 1974 年的第一版相同;新版增加了 Brooks 的《No Silver Bullets》论文(非常值得一读)和一些总结章节。
Well-written and full of insights, this reading is by far the most significant one on the subject of controlling system development. This is where you learn why adding more staff to a project that is behind schedule will delay it further. Although a few of the chapters are now a bit dated, much of the material here is timeless. Trouble in system development is also timeless, as evidenced by continual reports of failures of large system projects. Most successful system designers have a copy of this book on their bookshelf, and some claim to reread it at least once a year. Most of the 1995 edition is identical to the first, 1974, edition; the newer edition adds Brooks’ No Silver Bullets paper (which is well worth reading) and some summarizing chapters.
1.1.4Lawrence Lessig。《代码和其他网络空间法律》,版本 2.0。Basic Books,2006 年。ISBN:978-0-465-03914-28(平装本)。432 页;978-0-465-03913-5(平装本)。320 页。也可在线访问http://codev2.cc/
1.1.4 Lawrence Lessig. Code and Other Laws of Cyberspace, Version 2.0. Basic Books, 2006. ISBN: 978-0-465-03914-28 (paperback). 432 pages; 978-0-465-03913-5 (paperback). 320 pages. Also available on-line at http://codev2.cc/
本书是一位杰出的宪法教师对法律、习俗、市场力量和架构如何共同规范事物的解释的更新版本。除了提供词汇来讨论围绕技术和互联网的许多法律问题之外,本书的一个中心主题是,由于技术引发了法律和习俗都未预见到的问题,因此默认情况下它将完全由市场力量和架构来规范,而这两者都不受法律和习俗发展所特有的仔细和深思熟虑的思考。如果您对技术对知识产权、隐私或言论自由的影响感兴趣,那么这本书是必读之作。
This book is an updated version of an explanation by a brilliant teacher of constitutional law of exactly how law, custom, market forces, and architecture together regulate things. In addition to providing a vocabulary to discuss many of the legal issues surrounding technology and the Internet, a central theme of this book is that because technology raises issues that were foreseen neither by law nor custom, the default is that it will be regulated entirely by market forces and architecture, neither of which is subject to the careful and deliberative thought that characterize the development of law and custom. If you have any interest in the effect of technology on intellectual property, privacy, or free speech, this book is required reading.
1.1.5Jim [N.] Gray 和 Andreas Reuter。交易处理:概念和技术。Morgan Kaufmann,加利福尼亚州圣马特奥,1993 年(请查阅 1994 年第三次印刷的低容量纸质版)。ISBN:978-1-55860-190-1。1,070 页。
1.1.5 Jim [N.] Gray and Andreas Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, San Mateo, California, 1993 (Look for the low-bulk paper edition, which became available with the third printing in 1994). ISBN: 978-1-55860-190-1. 1,070 pages.
这本内容丰富的书汇集了容错、原子性、协调、恢复、回滚、日志、锁、事务以及性能工程权衡等各个方面。这是一本关于事务的权威著作。虽然这本书不适合初学者,但鉴于其高质量的解释,这本复杂的材料出奇地容易理解。术语表非常出色,而历史记录就其本身而言也很好,但有些以数据库为中心,不应将其视为最终结论。
All aspects of fault tolerance, atomicity, coordination, recovery, rollback, logs, locks, transactions, and engineering trade-offs for performance are pulled together in this comprehensive book. This is the definitive work on transactions. Though not intended for beginners, given the high quality of its explanations, this complex material is surprisingly accessible. The glossary of terms is excellent, whereas the historical notes are good as far as they go, but are somewhat database-centric and should not be taken as the final word.
1.1.6艾伦·F·韦斯汀。《隐私与自由》。Atheneum Press,1967 年。487 页。(已绝版。)
1.1.6 Alan F. Westin. Privacy and Freedom. Atheneum Press, 1967. 487 pages. (Out of print.)
如果你对隐私感兴趣,可以在图书馆或二手书店里找到这本书。这是一位宪法律师对隐私是什么、隐私为何重要以及隐私在美国法律框架中的地位所做的全面论述。
If you have any interest in privacy, track down a copy of this book in a library or used-book store. It is the comprehensive treatment, by a constitutional lawyer, of what privacy is, why it matters, and its position in the U.S. legal framework.
1.1.7罗斯·安德森。《安全工程:构建可靠分布式系统的指南》。John Wiley & Sons,第二版,2008 年。ISBN:978-0-470-06852-6。1,040 页。
1.1.7 Ross Anderson. Security Engineering: A Guide to Building Dependable Distributed Systems. John Wiley & Sons, second edition, 2008. ISBN: 978-0-470-06852-6. 1,040 pages.
这本书因其涉及的系统安全问题范围而引人注目,从出租车里程记录器到核指挥和控制系统。它对机制进行了深入的阐述,假设读者已经对系统有了大致的了解。这本书有时解释得很快;读者必须对系统非常了解。它的优点之一是,大多数关于如何做到这一点的讨论都紧接着一节题为“出了什么问题”,探讨了错误实施、谬误和其他故障模式。第一版可在线获取。
This book is remarkable for the range of system security problems it considers, from taxi mileage recorders to nuclear command and control systems. It provides great depth on the mechanics, assuming that the reader already has a high-level picture. The book is sometimes quick in its explanations; the reader must be quite knowledgeable about systems. One of its strengths is that most of the discussions of how to do it are immediately followed by a section titled “What goes wrong”, exploring misimplementations, fallacies, and other modes of failure. The first edition is available on-line.
1.2.1Andrew S. Tanenbaum。《现代操作系统》。Prentice-Hall,第三版,2008 年。ISBN:978-0-13-600663-3(精装)。952 页。
1.2.1 Andrew S. Tanenbaum. Modern Operating Systems. Prentice-Hall, third edition, 2008. ISBN: 978-0-13-600663-3 (hardcover). 952 pages.
本书提供了操作系统世界的全面教程介绍,但倾向于强调机制。它对事物设计成这样的原因有所洞察,但在很多情况下需要梳理。尽管如此,作为起点,它充满了进入其他文献所需的基本知识。它包括 GNU/Linux、Windows Vista 和手机操作系统 Symbian OS 的有用案例研究。
This book provides a thorough tutorial introduction to the world of operating systems but with a tendency to emphasize the mechanics. Insight into why things are designed the way they are is there, but in many cases requires teasing out. Nevertheless, as a starting point, it is filled with street knowledge that is needed to get into the rest of the literature. It includes useful case studies of GNU/Linux, Windows Vista, and Symbian OS, an operating system for mobile phones.
1.2.2托马斯·P·休斯。《拯救普罗米修斯》。旧版重印(平装本),最初出版于 1998 年。ISBN:978-0679739388。372 页。
1.2.2 Thomas P. Hughes. Rescuing Prometheus. Vintage reprint (paperback), originally published in 1998. ISBN: 978-0679739388. 372 pages.
一位退休的历史和社会学教授讲述了四个大型、独一无二的系统项目管理背后的故事:Sage 防空系统、Atlas 火箭、阿帕网(互联网的前身)和 Big Dig(波士顿中央干道/隧道)的设计阶段。这本书的主题是,除了独特的工程设计外,这类项目还必须发展出一种不同的管理风格,这种风格可以不断适应变化,与分布式控制松散耦合,并能在众多参与者之间达成共识。
A retired professor of history and sociology explains the stories behind the management of four large-scale, one-of-a-kind system projects: the Sage air defense system, the Atlas rocket, the Arpanet (predecessor of the Internet), and the design phase of the Big Dig (Boston Central Artery/Tunnel). The thesis of the book is that such projects, in addition to unique engineering, also had to develop a different kind of management style that can adapt continuously to change, is loosely coupled with distributed control, and can identify a consensus among many players.
1.2.3Henry Petroski。《设计范例:工程中的错误和判断案例史》。剑桥大学出版社,1994 年。ISBN:978-0-521-46108-5(精装本),978-0-521-46649-3(平装本)。221 页。
1.2.3 Henry Petroski. Design Paradigms: Case Histories of Error and Judgment in Engineering. Cambridge University Press, 1994. ISBN: 978-0-521-46108-5 (hardcover), 978-0-521-46649-3 (paperback). 221 pages.
这本出色的书探讨了设计师(在书中的例子中是土木工程师)的思维方式如何让他们犯下事后看来是巨大的设计错误。分析的失败案例包括罗马的柱子运输、1982 年堪萨斯城凯悦酒店人行道倒塌,以及其间发生的多起著名桥梁倒塌事件。Petroski 特别出色地分析了放大设计的失败通常如何表明原始设计正确运行,但原因与最初的想法不同。这本书中没有提到计算机系统,但它为计算机系统设计师提供了许多教训。
This remarkable book explores how the mindset of the designers (in the examples, civil engineers) allowed them to make what in retrospect were massive design errors. The failures analyzed range from the transportation of columns in Rome through the 1982 collapse of the walkway in the Kansas City Hyatt Regency Hotel, with a number of famous bridge collapses in between. Petroski analyzes particularly well how a failure of a scaled-up design often reveals that the original design worked correctly, but for a different reason than originally thought. There is no mention of computer systems in this book, but it contains many lessons for computer system designers.
1.2.4Bruce Schneier。《应用密码学》。John Wiley and Sons,第二版,1996 年。ISBN:978-0-471-12845-8(精装本),978-0-471-11709-4(平装本)。784 页。
1.2.4 Bruce Schneier. Applied Cryptography. John Wiley and Sons, second edition, 1996. ISBN: 978-0-471-12845-8 (hardcover), 978-0-471-11709-4 (paperback). 784 pages.
这里有您可能想要了解的有关密码学和密码协议的所有内容,包括对什么有效和什么无效的平衡观点。这本书省去了阅读和整理有关该主题的数千篇技术论文的麻烦。协议、技术、算法、实际考虑因素和源代码都可以在这里找到。除了胜任之外,这本书还写得很有趣,而且非常清晰。请注意,这本书中报告了一些小错误;如果您正在实现代码,最好通过查阅阅读1.3.13来验证细节。
Here is everything you might want to know about cryptography and cryptographic protocols, including a well-balanced perspective on what works and what doesn’t. This book saves the need to read and sort through the thousand or so technical papers on the subject. Protocols, techniques, algorithms, real-world considerations, and source code can all be found here. In addition to being competent, it is also entertainingly written and very articulate. Be aware that a number of minor errors have been reported in this book; if you are implementing code, it would be a good idea to verify the details by consulting reading 1.3.13.
1.2.5Radia Perlman。《互连》,第二版:网桥、路由器、交换机和网络间协议。Addison-Wesley,1999 年。ISBN:978-0-201-63448-8。560 页。
1.2.5 Radia Perlman. Interconnections, second edition: Bridges, Routers, Switches, and Internetworking Protocols. Addison-Wesley, 1999. ISBN: 978-0-201-63448-8. 560 pages.
本书介绍了您可能想知道的有关网络层实际工作原理的所有内容。风格非常随意,但内容绝对是一流的,探索了所有可能的变化。上一版的标题很简单:互连:网桥和路由器。
This book presents everything you could possibly want to know about how the network layer actually works. The style is engagingly informal, but the content is absolutely first-class, and every possible variation is explored. The previous edition was simply titled Interconnections: bridges and routers.
1.2.6Larry L. Peterson 和 Bruce S. Davie。计算机网络:系统方法。Morgan Kaufman,第四版,2007 年。ISBN:978-0-12-370548-8。848 页。
1.2.6 Larry L. Peterson and Bruce S. Davie. Computer Networks: A Systems Approach. Morgan Kaufman, fourth edition, 2007. ISBN: 978-0-12-370548-8. 848 pages.
本书从系统角度探讨了计算机网络。它很好地平衡了网络现状的原因和所使用的重要协议。它遵循分层模型,但介绍了独立于分层的基本概念。通过这种方式,本书很好地讨论了永恒的思想以及这些思想的当前体现。
This book provides a systems perspective on computer networks. It represents a good balance of why networks are they way they are and a discussion of the important protocols in use. It follows a layering model but presents fundamental concepts independent of layering. In this way, the book provides a good discussion of timeless ideas as well as current embodiments of those ideas.
还有很多其他的好书,许多计算机系统专业人士都坚持要把它们放在书架上。它们之所以没有出现在前面的类别中,是因为它们的核心重点不是系统,或者因为这本书的目的比较狭窄。
There are several other good books that many computer system professionals insist on having on their bookshelves. They don’t appear in one of the previous categories because their central focus is not on systems or because the purpose of the book is somewhat narrower.
1.3.1Thomas H. Cormen、Charles E. Leiserson、Ronald L. Rivest 和 Clifford Stein。《算法导论》。McGraw-Hill,第二版,2001 年。1,184 页。ISBN:978-0-07-297054-8(精装本);978-0-262-53196-2(麻省理工学院出版社平装本,不在美国销售)
1.3.1 Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. McGraw-Hill, second edition, 2001. 1,184 pages. ISBN: 978-0-07-297054-8 (hardcover); 978-0-262-53196-2 (M.I.T. Press paperback, not sold in U.S.A.)
1.3.2Nancy A. Lynch。分布式算法。Morgan Kaufman,1996 年。872 页 ISBN:978-1-55860-348-6。
1.3.2 Nancy A. Lynch. Distributed Algorithms. Morgan Kaufman, 1996. 872 pages ISBN: 978-1-55860-348-6.
有时,系统设计师需要一种算法。Cormen 等人和 Lynch 的书籍是找到该算法的地方,以及决定该算法是否适合该应用所需的分析。在理论阅读清单中,这两本书几乎肯定会属于最高类别之一,但对于系统清单,它们最好被认定为补充。
Occasionally, a system designer needs an algorithm. Cormen et al. and Lynch’s books are the place to find that algorithm, together with the analysis necessary to decide whether or not it is appropriate for the application. In a reading list on theory, these two books would almost certainly be in one of the highest categories, but for a systems list they are better identified as supplementary.
1.3.3道格拉斯·K·史密斯和罗伯特·C·亚历山大。《摸索未来》。威廉·莫罗出版社,1988 年。ISBN:978-0-688-06959-9(精装本),978-1-58348266-7(iuniverse 平装本重印本)。274 页。
1.3.3 Douglas K. Smith and Robert C. Alexander. Fumbling the Future. William Morrow and Company, 1988. ISBN: 978-0-688-06959-9 (hardcover), 978-1-58348266-7 (iuniverse paperback reprint). 274 pages.
计算机历史上有许多公司试图将通用计算机系统添加到现有业务中,例如福特、飞歌、Zenith、RCA、通用电气、霍尼韦尔、AT&T 和施乐。没有一家公司成功了,也许是因为当情况变得艰难时,放弃这项业务的选择太有吸引力了。本书记录了施乐如何通过发明个人电脑然后又放弃它,成功地从胜利的边缘夺回失败。
The history of computing is littered with companies that attempted to add general-purpose computer systems to an existing business—for examples, Ford, Philco, Zenith, RCA, General Electric, Honeywell, A. T. & T., and Xerox. None has succeeded, perhaps because when the going gets tough the option of walking away from this business is too attractive. This book documents how Xerox managed to snatch defeat from the jaws of victory by inventing the personal computer, then abandoning it.
1.3.4Marshall Kirk McKusick、Keith Bostic 和 Michael J. Karels。《4.4BSD 操作系统的设计和实现》。Addison -Wesley,第二版,1996 年。ISBN:978-0-201-54979-9。606 页。
1.3.4 Marshall Kirk McKusick, Keith Bostic, and Michael J. Karels. The Design and Implementation of the 4.4BSD Operating System. Addison-Wesley, second edition, 1996. ISBN: 978-0-201-54979-9. 606 pages.
本书全面介绍了伯克利版UNIX ®操作系统的设计和实现。内容丰富,细节丰富。1989 年出版的第一版介绍了 4.3BSD,至今仍然有用。
This book provides a complete picture of the design and implementation of the Berkeley version of the UNIX® operating system. It is well-written and full of detail. The 1989 first edition, describing 4.3BSD, is still useful.
1.3.5凯蒂·哈夫纳和约翰·马科夫。《赛博朋克:计算机前沿的不法之徒和黑客》。西蒙与舒斯特出版社(Touchstone),1991 年,1995 年 6 月更新。ISBN:978-0-671-68322-1(精装本),978-0-684-81862-7(平装本)。368 页。
1.3.5 Katie Hafner and John Markoff. Cyberpunk: Outlaws and Hackers on the Computer Frontier. Simon & Schuster (Touchstone), 1991, updated June 1995. ISBN: 978-0-671-68322-1 (hardcover), 978-0-684-81862-7 (paperback). 368 pages.
这本书非常易读,但内容详尽,描述了网络空间道德边缘的场景:凯文·米特尼克、汉斯·胡布纳和罗伯特·塔潘·莫里斯的功绩。它是媒体观点的一个例子,但却是一种不同寻常的见多识广的观点。
This book is a very readable, yet thorough, account of the scene at the ethical edges of cyberspace: the exploits of Kevin Mitnick, Hans Hubner, and Robert Tappan Morris. It serves as an example of a view from the media, but an unusually well-informed view.
1.3.6Deborah G. Johnson 和 Helen Nissenbaum。《计算机、伦理与社会价值观》。Prentice-Hall,1995 年。ISBN:978-0-13-103110-4(平装本)。714 页。
1.3.6 Deborah G. Johnson and Helen Nissenbaum. Computers, Ethics & Social Values. Prentice-Hall, 1995. ISBN: 978-0-13-103110-4 (paperback). 714 pages.
计算机系统设计师可能会认为阅读伦理学论文是度过下午的一种非常无聊的方式,而这本广泛的合集中的一些论文确实符合这种刻板印象。然而,在这本书的许多场景、案例研究和其他重印中,有大量关于计算机系统设计对人类影响的有趣而深思熟虑的论文。这个合集是获取有关隐私、风险、计算机滥用和软件所有权以及计算机系统设计中的职业道德的基本阅读资料的好地方。
A computer system designer is likely to consider reading a treatise on ethics to be a terribly boring way to spend the afternoon, and some of the papers in this extensive collection do match that stereotype. However, among the many scenarios, case studies, and other reprints in this volume are a large number of interesting and thoughtful papers about the human consequences of computer system design. This collection is a good place to acquire the basic readings concerning privacy, risks, computer abuse, and software ownership as well as professional ethics in computer system design.
1.3.7Carliss Y. Baldwin 和 Kim B. Clark。《设计规则:第 1 卷,模块化的力量》。麻省理工学院出版社,2000 年。ISBN:978-0-262-02466-2。471 页。
1.3.7 Carliss Y. Baldwin and Kim B. Clark. Design Rules: Volume 1, The Power of Modularity. M.I.T. Press, 2000. ISBN: 978-0-262-02466-2. 471 pages.
本书完全侧重于模块化(作者认为该术语融合了模块化、抽象和层次结构),并提供了有趣的互连表示,以说明模块化和清晰抽象接口的强大功能。本书使用这些相同的概念来解释计算机行业数十年的发展。来自哈佛商学院的作者通过提供设计选项和简化替换,开发了模块化运作的几种方式的模型。读完本书后,大多数读者都会看到比他们想知道的更多内容,但本书中的一些想法值得快速阅读。(尽管标题中有“第 1 卷”,但似乎还没有第 2 卷。)
This book focuses wholely on modularity (as used by the authors, this term merges modularity, abstraction, and hierarchy) and offers an interesting representation of interconnections to illustrate the power of modularity and of clean, abstract interfaces. The work uses these same concepts to interpret several decades of developments in the computer industry. The authors, from the Harvard Business School, develop a model of the several ways in which modularity operates by providing design options and making substitution easy. By the end of the book, most readers will have seen more than they wanted to know, but there are some ideas here that are worth at least a quick reading. (Despite the “Volume 1” in the title, there does not yet seem to be a Volume 2.)
1.3.8Andrew S. Tanenbaum。计算机网络。Prentice-Hall,第四版,2003 年。ISBN:978-0-13-066102-9。813 页。
1.3.8 Andrew S. Tanenbaum. Computer Networks. Prentice-Hall, fourth edition, 2003. ISBN: 978-0-13-066102-9. 813 pages.
本书提供了网络世界的全面教程介绍。与同一作者的操作系统书籍(参见阅读1.2.1)一样,本书也倾向于强调机制。但它同样是最新街头知识的宝库,这次是关于计算机通信的,这是了解(或可能避免查阅)其余文献所必需的。本书包括一份关于计算机网络的精选和经过深思熟虑的注释书目。该材料的精简版足以满足许多读者的需求,作为操作系统书籍的一章出现。
This book provides a thorough tutorial introduction to the world of networks. Like the same author’s book on operating systems (see reading 1.2.1), this one also tends to emphasize the mechanics. But again it is a storehouse of up-to-date street knowledge, this time about computer communications, that is needed to get into (or perhaps avoid the need to consult) the rest of the literature. The book includes a selective and thoughtfully annotated bibliography on computer networks. An abbreviated version of this same material, sufficient for many readers, appears as a chapter of the operating systems book.
1.3.9David L. Mills。计算机网络时间同步:网络时间协议。CRC Press/Taylor & Francis,2006 年。ISBN:978-0849358050。286 页。
1.3.9 David L. Mills. Computer Network Time Synchronization: The Network Time Protocol. CRC Press/Taylor & Francis, 2006. ISBN: 978-0849358050. 286 pages.
对网络时间协议 (NTP) 进行了全面但非常易读的解释,这是大多数用户都不知道的底层协议:NTP 协调多个计时器并将当前日期和时间信息分发给客户端和服务器。
A comprehensive but very readable explanation of the Network Time Protocol (NTP), an under-the-covers protocol of which most users are unaware: NTP coordinates multiple timekeepers and distributes current date and time information to both clients and servers.
1.3.10Robert G. Gallager。《数字通信原理》。剑桥大学出版社,2008 年。ISBN:978-0-521-87907-1。422 页。
1.3.10 Robert G. Gallager. Principles of Digital Communication. Cambridge University Press, 2008. ISBN: 978-0-521-87907-1. 422 pages.
这本内容丰富的教科书重点介绍了数据通信网络链路层的基础理论。它不适合随意浏览或容易被数学吓倒的人,但它是分析的极佳参考资料。
This intense textbook focuses on the theory that underlies the link layer of data communication networks. It is not for casual browsing or for those easily intimidated by mathematics, but its an excellent reference source for analysis.
1.3.11Daniel P. Siewiorek 和 Robert S. Swarz。《可靠计算机系统:设计和评估》。AK Peters Ltd.,第三版,1998 年。ISBN:978-1-56881-092-8。927 页。
1.3.11 Daniel P. Siewiorek and Robert S. Swarz. Reliable Computer Systems: Design and Evaluation. A. K. Peters Ltd., third edition, 1998. ISBN: 978-1-56881-092-8. 927 pages.
这可能是目前对可靠性最全面的论述,理论解释得很清楚,并重印了近期文献中的几个案例研究。它的唯一缺陷是略带“学术”偏见,因为对替代方法的判断很少,而且有些示例没有警告从未真正部署过的系统。第一版(1982 年)的标题为《可靠系统设计的理论与实践》,包含一组几乎完全不同(且更古老)的案例研究。
This is probably the best comprehensive treatment of reliability that is available, with well-explained theory and reprints of several case studies from recent literature. Its only defect is a slight “academic” bias in that little judgment is expressed on alternative methods, and some examples are without warning of systems that were never really deployed. The first, 1982, edition, with the title The Theory and Practice of Reliable System Design, contains an almost completely different (and much older) set of case studies.
1.3.12布鲁斯·施奈尔。《秘密与谎言/网络世界中的数字安全》。John Wiley & Sons,2000 年。ISBN:978-0-471-25311-2(精装本),978-0-471-45380-2(平装本)。432 页。
1.3.12 Bruce Schneier. Secrets & Lies/Digital Security in a Networked World. John Wiley & Sons, 2000. ISBN: 978-0-471-25311-2 (hardcover), 978-0-471-45380-2 (paperback). 432 pages.
本书从系统角度概述了安全性,提供了许多激励、许多精彩的战争故事(尽管没有引用)以及如何实现安全系统的高级概述。作为概述,它没有提供关于机制的具体指导,只能依靠那些知道自己在做什么的人。这本书非常棒,特别是对于那些想要超越流行语并了解实现计算机系统安全需要做什么的经理来说。
This overview of security from a systems perspective provides much motivation, many good war stories (though without citations), and a high-level outline of how one achieves a secure system. Being an overview, it provides no specific guidance on the mechanics, other than to rely on people who know what they are doing. This is excellent book particularly for the manager who wants to go beyond the buzzwords and get an idea of what achieving computer system security involves.
1.3.13A[lfred] J. Menezes、Paul C. Oorschot 和 Scott A. Vanstone。《应用密码学手册》。CRC Press,1997 年。ISBN:978-08493-8523-0。816 页。
1.3.13 A[lfred] J. Menezes, Paul C. Oorschot, and Scott A. Vanstone. Handbook of Applied Cryptography. CRC Press, 1997. ISBN: 978-08493-8523-0. 816 pages.
这本书正如其标题所言:一本非常完整的密码学实用手册。它缺乏阅读 1.2.4的背景和视角,而且技术性极强,这使得部分内容对于数学水平较低的读者来说难以理解。但其精确的定义和细致的解释使其成为迄今为止有关该主题的最佳参考书。
This book is exactly what its title claims: a very complete handbook on putting cryptography to work. It lacks the background and perspective of reading 1.2.4, and it is extremely technical, which makes parts of it inaccessible to less mathematically inclined readers. But its precise definitions and careful explanations make this by far the best reference book available on the subject.
1.3.14Johannes A. Buchman。《密码学简介》。Springer,第二版,2004 年。ISBN:978-0-387-21156-56(精装)。335 页。
1.3.14 Johannes A. Buchman. Introduction to Cryptography. Springer, second edition, 2004. ISBN: 978-0-387-21156-56 (hardcover). 335 pages.
布赫曼对密码学的数论进行了简洁而精彩的介绍。
Buchman provides a nice, concise introduction to number theory for cryptography.
1.3.15Simson Garfinkel 和 Gene [Eugene H.] Spafford。《实用 Unix 和 Internet 安全》。O'Reilly & Associates,加利福尼亚州塞巴斯托波尔,第三版,2003 年。ISBN:978-59600323-4(平装本)。986 页。
1.3.15 Simson Garfinkel and Gene [Eugene H.] Spafford. Practical Unix and Internet Security. O’Reilly & Associates, Sebastopol, California, third edition, 2003. ISBN: 978-59600323-4 (paperback). 986 pages.
这是一本关于如何运行网络连接UNIX系统的真正全面的指南,可以确信该系统相对安全,不会受到偶然入侵者的攻击。除了为系统管理员提供实用信息外,它还让读者深入了解了提供安全性所需的思维和设计风格。
This is a really comprehensive guide to how to run a network-attached UNIX system with some confidence that it is relatively safe against casual intruders. In addition to providing practical information for a system manager, it incidentally gives the reader quite a bit of insight into the style of thinking and design needed to provide security.
1.3.16Simson Garfinkel。PGP :Pretty Good Privacy。O'Reilly & Associates,加利福尼亚州塞巴斯托波尔,1995 年。ISBN:978-1-56592-098-9(平装本)。430 页。
1.3.16 Simson Garfinkel. PGP: Pretty Good Privacy. O’Reilly & Associates, Sebastopol, California, 1995. ISBN: 978-1-56592-098-9 (paperback). 430 pages.
本书名义上是 Phil Zimmermann 开发的 PGP 加密软件包的用户指南,开头有六章非常易读的概述,介绍了加密、加密历史以及加密系统周围的政治和许可环境。甚至后面详细介绍了如何使用 PGP 的章节也充满了有趣的花絮和适用于所有加密用途的建议。
Nominally a user’s guide to the PGP encryption package developed by Phil Zimmermann, this book starts out with six very readable overview chapters on the subject of encryption, its history, and the political and licensing environment that surrounds encryption systems. Even the later chapters, which give details on how to use PGP, are filled with interesting tidbits and advice applicable to all encryption uses.
1.3.17Warwick Ford 和 Michael S. Baum。安全电子商务:构建数字签名和加密的基础设施。Prentice-Hall,第二版,2000 年。ISBN:978-0-13-027276-8。640 页。
1.3.17 Warwick Ford and Michael S. Baum. Secure Electronic Commerce: Building the Infrastructure for Digital Signatures and Encryption. Prentice-Hall, second edition, 2000. ISBN: 978-0-13-027276-8. 640 pages.
尽管书名暗示了更广泛的含义,但这本书是关于公钥基础设施的:证书颁发机构、证书及其在实践中的法律地位。作者分别是技术专家(福特)和律师(鲍姆)。本书提供了非常全面的介绍,是了解该主题的好方法。但是,由于该主题的状态变化很快,因此应将其视为快照,而不是最新词汇。
Although the title implies more generality, this book is about public key infrastructure: certificate authorities, certificates, and their legal status in practice. The authors are a technologist (Ford) and a lawyer (Baum). The book provides very thorough coverage and is a good way to learn a lot about the subject. Because the status of this topic changes rapidly, however, it should be considered a snapshot rather than the latest word.
有不少书试图概括系统研究。然而,它们往往过于抽象,很难看出它们如何应用于任何事物,因此这里没有列出任何一本书。相反,这里有五篇古老但出人意料地相关的论文,它们说明了思考系统的方式。涉及的领域包括异速生长、空气动力学、层次结构、生态学和经济学。
Quite a few books try to generalize the study of systems. They tend to be so abstract, however, that it is hard to see how they apply to anything, so none of them are listed here. Instead, here are five old but surprisingly relevant papers that illustrate ways to think about systems. The areas touched are allometry, aerodynamics, hierarchy, ecology, and economics.
1.4.1约翰·伯登·桑德森·霍尔丹(1892-1964 年)。《论合适的尺寸》。《可能的世界和其他论文集》,第 20-28 页。哈珀兄弟出版社,1928 年。1927 年由伦敦 Chatto & Windus 出版社出版,最近重印于约翰·梅纳德·史密斯编辑的《论合适的尺寸和其他论文集》,牛津大学出版社,1985 年。ISBN:0-19-286045-3(平装本),第 1-8 页。
1.4.1 J[ohn] B[urdon] S[anderson] Haldane (1892–1964). On being the right size. In Possible Worlds and Other Essays, pages 20–28. Harper and Brothers Publishers, 1928. Also published by Chatto & Windus, London, 1927, and recently reprinted in John Maynard Smith, editor, On Being the Right Size and Other Essays, Oxford University Press, 1985. ISBN: 0-19-286045-3 (paperback), pages 1–8.
这是一篇经典论文,解释了为什么一只大象大小的老鼠在试图站起来时会倒下。它为如何思考各种系统中的不相称缩放提供了教训。
This is the classic paper that explains why a mouse the size of an elephant would collapse if it tried to stand up. It provides lessons on how to think about incommensurate scaling in all kinds of systems.
1.4.2亚历山大·格雷厄姆·贝尔(1847-1922)。风筝结构的四面体原理。《国家地理杂志》14,6(1903 年 6 月),第 219-251 页。
1.4.2 Alexander Graham Bell (1847–1922). The tetrahedral principle in kite structure. National Geographic Magazine 14, 6 (June 1903), pages 219–251.
这篇经典论文表明,基于规模的论证可能非常微妙。这篇论文写于物理学家们仍在争论制造飞机的理论可能性的时候,它描述了反对重于空气的飞行器的明显规模论证,然后证明了可以通过不同的方式增加机翼的规模,而明显规模论证并不适用于所有这些方式。(这篇论文是一个罕见的例子,未经审查就自费出版了一项有趣的工程成果。《国家地理》曾经是——现在仍然是——贝尔家族的出版物。)
This classic paper demonstrates that arguments based on scale can be quite subtle. This paper—written at a time when physicists were still debating the theoretical possibility of building airplanes—describes the obvious scale argument against heavier-than-air craft and then demonstrates that one can increase the scale of an airfoil in different ways and that the obvious scale argument does not apply to all those ways. (This paper is a rare example of unreviewed vanity publication of an interesting engineering result. The National Geographic was—and still is—a Bell family publication.)
1.4.3赫伯特·A·西蒙(1916-2001)。《复杂性的架构》。《美国哲学学会会刊》第 106卷,第 6 期(1962 年 12 月),第 467-482 页。1969 年,重新出版为《人工智能科学》第 4 章,第 84-118 页。ISBN:0-262-191051-6(精装本);0-262-69023-3(平装本)。
1.4.3 Herbert A. Simon (1916–2001). The architecture of complexity. Proceedings of the American Philosophical Society 106, 6 (December 1962), pages 467–482. Republished as Chapter 4, pages 84–118, of The Sciences of the Artificial, M.I.T. Press, Cambridge, Massachusetts, 1969. ISBN: 0-262-191051-6 (hardcover); 0-262-69023-3 (paperback).
这篇论文是一篇关于层级如何成为复杂系统的组织工具的杰作。这些例子的范围和范围令人惊叹——从钟表制造和生物学到政治帝国。这篇论文所展示的思维方式表明,西蒙后来获得 1978 年诺贝尔经济学奖并不令人意外。
This paper is a tour-de-force of how hierarchy is an organizing tool for complex systems. The examples are breathtaking in their range and scope—from watch-making and biology through political empires. The style of thinking shown in this paper suggests that it is not surprising that Simon later received the 1978 Nobel Prize in economics.
1.4.4拉蒙特·库克·科尔(1916-1978 年)。《人类对自然的影响》。《探险家:克利夫兰自然历史博物馆公报》第 11 卷,第 3 期(1969 年秋季),第 10-16 页。
1.4.4 LaMont C[ook] Cole (1916–1978). Man’s effect on nature. The Explorer: Bulletin of the Cleveland Museum of Natural History 11, 3 (Fall 1969), pages 10–16.
这篇简短的文章将地球视为一个生态系统,人类的行为既会带来意外,也会带来影响的传播。它描述了影响传播的一个典型例子:试图在北婆罗洲消灭疟疾,结果却导致瘟疫增加和屋顶塌陷。
This brief article looks at the Earth as an ecological system in which the actions of humans lead both to surprises and to propagation of effects. It describes a classic example of the propagation of effects: attempts to eliminate malaria in North Borneo led to an increase in the plague and roofs caving in.
1.4.5加勒特·[詹姆斯]·哈丁(1915-)。《公地悲剧》。《科学》162,3859(1968 年 12 月 13 日),第 1243-1248 页。《公地悲剧》的延伸。《科学》280,5364(1998 年 5 月 1 日),第 682-683 页。
1.4.5 Garrett [James] Hardin (1915–). The tragedy of the commons. Science 162, 3859 (December 13, 1968), pages 1243–1248. Extensions of “the tragedy of the commons”. Science 280, 5364 (May 1, 1998), pages 682–683.
这篇开创性的论文探讨了某些经济形势下亚当·斯密的“看不见的手”违背所有人利益的特性。它之所以有趣,是因为它洞察了如何预测原本难以建模的系统。30 年后,哈丁在重新审视这个主题时建议在“公地”前面加上“不受管理”这个形容词。无论对错,互联网经常被描述为一个适用于(不受管理的)公地悲剧的系统。
This seminal paper explores a property of certain economic situations in which Adam Smith’s “invisible hand” works against everyone’s interest. It is interesting for its insight into how to predict things about otherwise hard-to-model systems. In revisiting the subject 30 years later, Hardin suggested that the adjective “unmanaged” should be placed in front of “commons”. Rightly or wrongly, the Internet is often described as a system to which the tragedy of the (unmanaged) commons applies.
在阅读有关该主题的其他内容之前,你应该先阅读 Brooks 的《人月神话》(阅读1.1.3)和 Simon 的论文《复杂性的架构》(阅读1.4.3) 。第 1.9 节中关于复杂性控制的案例研究也充满了智慧。
Before reading anything else on this topic, one should absorb the book by Brooks, The Mythical Man-Month, reading 1.1.3, and the essay by Simon, “The architecture of complexity”, reading 1.4.3. The case studies on control of complexity in Section 1.9 also are filled with wisdom.
1.5.1Richard P. Gabriel。更糟的是更好。摘自《LISP:好消息,坏消息,如何赢得大胜利》,《AI Expert 6》,第 6 期(1991 年 6 月),第 33-35 页。
1.5.1 Richard P. Gabriel. Worse is better. Excerpt from LISP: good news, bad news, how to win BIG, AI Expert 6, 6 (June 1991), pages 33–35.
本文解释了为什么有时权宜之计比正确地做事更好。
This paper explains why doing the thing expediently sometimes works out to be a better idea than doing the thing right.
1.5.2Henry Petroski。工程:历史与失败。《美国科学家》80,6(1992 年 11-12 月),第 523-526 页。
1.5.2 Henry Petroski. Engineering: History and failure. American Scientist 80, 6 (November-December 1992), pages 523–526.
Petroski 的见解大致是,工程学取得进步的主要方式之一是犯错、研究错误并再次尝试。Petroski 还在两本书中探讨了这个主题,最近的一本是《阅读1.2.3》。
Petroski provides insight along the lines that one primary way that engineering makes progress is by making mistakes, studying them, and trying again. Petroski also visits this theme in two books, the most recent being reading 1.2.3.
1.5.3Fernando J. Corbató。《论构建会失败的系统》。《ACM 通讯》34,9(1991 年 9 月),第 72-81 页。(Johnson 和 Nissenbaum 在书中重印,内容为1.3.6。)
1.5.3 Fernando J. Corbató. On building systems that will fail. Communications of the ACM 34, 9 (September 1991), pages 72–81. (Reprinted in the book by Johnson and Nissenbaum, reading 1.3.6.)
1991 年图灵奖演讲的核心思想是,所有雄心勃勃的系统都会失败,但那些以此预期设计的系统更有可能最终取得成功。
The central idea in this 1991 Turing Award Lecture is that all ambitious systems will have failures, but those that were designed with that expectation are more likely to eventually succeed.
1.5.4Butler W. Lampson。计算机系统设计提示。第九届 ACM 操作系统原理研讨会论文集,载于《操作系统评论》17,5(1983 年 10 月),第 33-48 页。后来重新发表,但编辑不太令人满意,载于《IEEE 软件》1,1(1984年 1 月),第 11-28 页。
1.5.4 Butler W. Lampson. Hints for computer system design. Proceedings of the Ninth ACM Symposium on Operating Systems Principles, in Operating Systems Review 17, 5 (October 1983), pages 33–48. Later republished, but with less satisfactory copy editing, in IEEE Software 1, 1 (January 1984), pages 11–28.
这些见解总结为似乎适用于多种情况的原则。值得所有系统设计师阅读。
This encapsulation of insights is expressed as principles that seem to apply to more than one case. It is worth reading by all system designers.
1.5.5Jon Bentley。信封背面 — 编程珍珠。ACM通讯 27,3(1984 年 3 月),第 180-184 页。
1.5.5 Jon Bentley. The back of the envelope—programming pearls. Communications of the ACM 27, 3 (March 1984), pages 180–184.
系统设计师最重要的工具之一是能够粗略但快速地估计设计的大小、时间、速度或成本。这篇简短的文章阐述了这一概念并给出了几个例子。
One of the most important tools of a system designer is the ability to make rough but quick estimates of how big, how long, how fast, or how expensive a design will be. This brief note extols the concept and gives several examples.
1.5.6Jeffrey C. Mogul。突发(错误)行为与复杂软件系统。《第一届欧洲计算机系统会议论文集》(EuroSys 2006,比利时鲁汶),第 293–304 页。ACM Press,2006 年,ISBN:1-59593-322-0。另见《操作系统评论》40,第 4 期(2006 年 10 月)。
1.5.6 Jeffrey C. Mogul. Emergent (mis)behavior vs. complex software systems. Proceedings of the First European Conference on Computer Systems (EuroSys 2006, Leuven, Belgium), pages 293–304. ACM Press, 2006, ISBN: 1-59593-322-0. Also in Operating Systems Review 40, 4 (October 2006).
本文深入探讨了第 1 章中描述的涌现属性概念,提供了大量示例,并将计算机和网络系统设计过程中出现的问题联系在一起。它还提出了涌现属性的分类法,列出了未来研究的建议,并包含了一份全面而实用的参考书目。
This paper explores in depth the concept of emergent properties described in Chapter 1, providing a nice collection of examples and tying together issues and problems that arise throughout computer and network system design. It also suggests a taxonomy of emergent properties, lays out suggestions for future research, and includes a comprehensive and useful bibliography.
1.5.7Pamela Samuelson,编辑。《信息时代的知识产权》。《ACM 通讯》44,2(2001 年 2 月),第 67-103 页。
1.5.7 Pamela Samuelson, editor. Intellectual property for an information age. Communications of the ACM 44, 2 (February 2001), pages 67–103.
这项工作是一个特别部分,包含几篇关于数字世界中知识产权挑战的论文。每一篇文章都是由新一代专家撰写的,他们对技术和法律都了如指掌,可以为这两个领域提供一些深刻的见解。
This work is a special section comprising several papers about the challenges of intellectual property in a digital world. Each of the individual articles is written by a member of a new generation of specialists who understand both technology and law well enough to contribute some thoughtful insights to both domains.
1.5.8Mark R. Chassin 和 Elise C. Becher。《错误的病人》。《内科医学年鉴》136(2002 年 6 月),第 826-833 页。
1.5.8 Mark R. Chassin and Elise C. Becher. The wrong patient. Annals of Internal Medicine 136 (June 2002), pages 826–833.
本文是一个很好的例子,首先,它说明了复杂系统如何因复杂原因而失败;其次,它说明了“不断挖掘”原则的价值。本文介绍的案例研究围绕一个医疗系统故障展开,该故障导致错误的患者接受手术。案例研究不仅找出了最明显的原因,还得出结论,有十多个机会应该发现并纠正导致故障的错误,但由于各种原因,所有这些机会都被错过了。
This paper is a good example, first, of how complex systems fail for complex reasons and second, of the value of the “keep digging” principle. The case study presented here centers on a medical system failure in which the wrong patient was operated on. Rather than just identifying the most obvious reason, the case study concludes that there were a dozen or more opportunities in which the error that led to the failure should have been detected and corrected, but for various reasons all of those opportunities were missed.
1.5.9P[hillip] J. Plauger。Chocolate。嵌入式系统编程 7,3(1994 年 3 月),第 81-84 页。
1.5.9 P[hillip] J. Plauger. Chocolate. Embedded Systems Programming 7, 3 (March 1994), pages 81–84.
本文基于以下观察提出了一个非凡的见解:面包店的许多失败可以通过在混合物中加入更多巧克力来补救。作者只需稍加努力,就能将这一观察转化为一种更通用的技术,使恢复过程保持简单,从而更有可能取得成功。
This paper provides a remarkable insight based on the observation that many failures in a bakery can be remedied by putting more chocolate into the mixture. The author manages, with only a modest stretch, to convert this observation into a more general technique of keeping recovery simple, so that it is likely to succeed.
1.6.1Gordon E. Moore。在集成电路中塞入更多元件。《电子学》38 卷,8 期(1965 年 4 月 19 日),第 114-117 页。重印于《IEEE 论文集》86 卷,1 期(1998 年 1 月),第 82-85 页。
1.6.1 Gordon E. Moore. Cramming more components onto integrated circuits. Electronics 38, 8 (April 19, 1965), pages 114–117. Reprinted in Proceedings of the IEEE 86, 1 (January 1998), pages 82–85.
本文定义了我们现在所说的摩尔定律。摩尔描述的现象推动了四十多年来技术进步的速度。本文阐述了原因,并展示了第一张基于五个数据点绘制摩尔定律的图表。
This paper defined what we now call Moore’s law. The phenomena Moore describes have driven the rate of technology improvement for more than four decades. This paper articulates why and displays the first graph to plot Moore’s law, based on five data points.
1.6.2John L. Hennessy 和 Norman P. Jouppi。计算机技术与架构:不断发展的互动。IEEE计算机 24,9(1991 年 9 月),第 19-29 页。
1.6.2 John L. Hennessy and Norman P. Jouppi. Computer technology and architecture: An evolving interaction. IEEE Computer 24, 9 (September 1991), pages 19–29.
虽然一些技术示例有点过时,但系统思维和论文的见解仍然具有现实意义。
Although some of the technology examples are a bit of out of date, the systems thinking and the paper’s insights remain relevant.
1.6.3Ajanta Chakraborty 和 Mark R. Greenstreet。用于跨时钟域的高效自定时接口。第九届国际异步电路和系统研讨会论文集,IEEE 计算机学会(2003 年 5 月),第 78-88 页。ISBN:0-7695-1898-2。
1.6.3 Ajanta Chakraborty and Mark R. Greenstreet. Efficient self-timed interfaces for crossing clock domains. Proceedings of the Ninth International Symposium on Asynchronous Circuits and Systems, IEEE Computer Society (May 2003), pages 78–88. ISBN: 0-7695-1898-2.
本文通过将芯片上的资源组织为由异步链路连接的多个同步岛来解决在芯片上实现快速全局时钟的挑战。这种设计可能会给构建完美的仲裁器带来问题(参见第 5.2.8 节)。
This paper addresses the challenge of having a fast, global clock on a chip by organizing the resources on a chip as a number of synchronous islands connected by asynchronous links. This design may pose problems for constructing perfect arbiters (see Section 5.2.8).
1.6.4Anant Agarwal 和 Markus Levy。多核的 KILL 规则。第 44 届 ACM/IEEE 设计自动化会议(2007 年 6 月),第 750-753 页。ISBN:978-1-59593-627-1。
1.6.4 Anant Agarwal and Markus Levy. The KILL Rule for multicore. 44th ACM/IEEE Conference on Design Automation (June 2007), pages 750–753. ISBN: 978-1-59593-627-1.
这篇短文展望了多处理器芯片,这些芯片不仅包含四个或八个处理器,还包含数千个处理器。它阐明了节能设计的一条规则:线性不足则淘汰。例如,只有当芯片面积每增加 1% 时,芯片性能至少会提高 1%,设计人员才应该增加用于缓存等资源的芯片面积。这条规则将注意力集中在那些最有效地利用芯片面积的设计元素上,从粗略计算来看,增加处理器数量(本文假设这会带来线性改进)比其他选择更有利。
This short paper looks ahead to multiprocessor chips that contain not just four or eight, but thousands of processors. It articulates a rule for power-efficient designs: Kill If Less than Linear. For example, the designer should increase the chip area devoted to a resource such as a cache only if for every 1% increase in area there is at least a 1% increase in chip performance. This rule focuses attention on those design elements that make most effective use of the chip area and from back-of-the-envelope calculations favors increasing processor count (which the paper assumes to provide linear improvement) over other alternatives.
1.6.5Stephen P. Walborn 等人。《量子擦除》。《美国科学家》91,4(2003 年 7-8 月),第 336-343 页。
1.6.5 Stephen P. Walborn et al. Quantum erasure. American Scientist 91, 4 (July-August 2003), pages 336–343.
这篇论文是由物理学家撰写的,要求读者具备本科水平的现代物理学知识,但设法避免涉及研究生水平的量子力学。本文的优势在于它清楚地指出了这些现象中哪些是人们相当了解的,哪些仍然是个谜。这种辨别似乎对物理学学生和密码学学生都具有相当大的价值,物理学学生可能会受到启发去解决尚未理解的部分,而密码学学生则可能知道量子密码学的哪些方面仍然是个谜,这可能对决定在多大程度上依赖它很重要。
This paper was written by physicists and requires a prerequisite of undergraduate-level modern physics, but it manages to avoid getting into graduate-level quantum mechanics. The strength of the article is its clear identification of what is reasonably well understood and what is still a mystery about these phenomena. That identification seems to be of considerable value both to students of physics, who may be inspired to tackle the parts that are not understood, and to students of cryptography, because knowing what aspects of quantum cryptography are still mysteries may be important in deciding how much reliance to place on it.
偶尔会出现一篇论文,它要么对未来系统的功能有着戏剧性的展望,要么对之前被认为已经解决的系统设计方面进行了彻底的重新审视。阅读第 1.7 和 1.8 节中列出的论文中的想法经常成为该领域所有未来作家的标准内容,但重印很少能与原作相提并论,即使只是为了看看有远见的人(或修正主义者)的思维方式,这些论文也值得一读。
Once in a while a paper comes along that either has a dramatic vision of what future systems might do or takes a sweeping new look at some aspect of systems design that had previously been considered to be settled. The ideas found in the papers listed in reading sections 1.7 and 1.8 often become part of the standard baggage of all future writers in the area, but the reprises rarely do justice to the originals, which are worth reading if only to see how the mind of a visionary (or revisionist) works.
1.7.1万尼瓦尔·布什。正如我们所想。《大西洋月刊》176,1(1945 年 7 月),第 101-108 页。转载自 Adele J. Goldberg 的《个人工作站的历史》,Addison-Wesley,1988 年,第 237-247 页,以及 Irene Greif 编辑的《计算机支持的协同工作:一本阅读书》,Morgan Kaufman,1988 年。ISBN:0-934613-57-5。
1.7.1 Vannevar Bush. As we may think. Atlantic Monthly 176, 1 (July 1945), pages 101–108. Reprinted in Adele J. Goldberg, A History of Personal Workstations, Addison-Wesley, 1988, pages 237–247 and also in Irene Greif, ed., Computer-Supported Cooperative Work: A Book of Readings, Morgan Kaufman, 1988. ISBN: 0-934613-57-5.
布什研究了 1945 年的计算机(大部分是模拟计算机),并预见到它们有一天会被用作信息引擎来增强人类智力。
Bush looked at the (mostly analog) computers of 1945 and foresaw that they would someday be used as information engines to augment the human intellect.
1.7.2John G. Kemeny,Robert M. Fano 和 Gilbert W. King 注释。《公元2000年图书馆》。载于 Martin Greenberger 主编的《管理与未来的计算机》,麻省理工学院出版社和 John Wiley 出版社,1962 年,第 134-178 页。(已绝版。)
1.7.2 John G. Kemeny, with comments by Robert M. Fano and Gilbert W. King. A library for 2000 A.D. In Martin Greenberger, editor, Management and the Computer of the Future, M.I.T. Press and John Wiley, 1962, pages 134–178. (Out of print.)
经过 40 年的技术进步,Kemeny 的愿景得以实现,即当计算机支持图书馆时,图书馆将如何发展。不幸的是,所需的工程仍未完成,因此该愿景尚未实现,但 Google 已提出类似的愿景,并在实现该愿景方面取得了进展;请参阅阅读3.2.4。
It has taken 40 years for technology to advance far enough to make it possible to implement Kemeny’s vision of how the library might evolve when computers are used in its support. Unfortunately, the engineering that is required still hasn’t been done, so the vision has not yet been realized, but Google has stated a similar vision and is making progress in realizing it; see reading 3.2.4.
1.7.3[Alan C. Kay,与]学习研究小组合作。个人动态媒体。施乐帕洛阿尔托研究中心系统软件实验室技术报告 SSL-76-1(未注明日期,大约 1976 年 3 月)。
1.7.3 [Alan C. Kay, with the] Learning Research Group. Personal Dynamic Media. Xerox Palo Alto Research Center Systems Software Laboratory Technical Report SSL-76-1 (undated, circa March 1976).
早在大多数人意识到台式计算机可能是一个好主意之前,Alan Kay 就在想象笔记本电脑及其用途。他就这个主题发表了许多鼓舞人心的演讲,但他很少停下来写下任何东西。幸运的是,他的同事在这份技术报告中捕捉到了他的部分想法。这份报告的编辑版本(其中有些图片被意外省略)出现在这份技术报告之后的一年的一份期刊上:Alan [C.] Kay 和 Adele Goldberg。个人动态媒体。IEEE计算机 10,3(1977 年 3 月),第 31-41 页。这篇论文被重新印制,其中省略的图片被恢复在 Adele J. Goldberg 的《个人工作站的历史》中,Addison-Wesley,1988 年,第 254-263 页。ISBN:0-201-11259-0。
Alan Kay was imagining laptop computers and how they might be used long before most people had figured out that desktop computers might be a good idea. He gave many inspiring talks on the subject, but he rarely paused long enough to write anything down. Fortunately, his colleagues captured some of his thoughts in this technical report. An edited version of this report, with some pictures accidentally omitted, appeared in a journal in the year following this technical report: Alan [C.] Kay and Adele Goldberg. Personal dynamic media. IEEE Computer 10, 3 (March 1977), pages 31–41. This paper was reprinted with omitted pictures restored in Adele J. Goldberg, A history of personal workstations, Addison-Wesley, 1988, pages 254–263. ISBN: 0-201-11259-0.
1.7.4道格[拉斯] C. 恩格尔巴特。《增强人类智力:一个概念框架》。研究报告 AFOSR-3223,斯坦福研究所,加利福尼亚州门洛帕克,1962 年 10 月。重印于 Irene Greif 主编的《计算机支持的协同工作:一本阅读书》,Morgan Kaufman,1988 年。ISBN:0-934613-57-5。
1.7.4 Doug[las] C. Engelbart. Augmenting Human Intellect: A Conceptual Framework. Research Report AFOSR-3223, Stanford Research Institute, Menlo Park, California, October 1962. Reprinted in Irene Greif, ed., Computer-Supported Cooperative Work: A Book of Readings, Morgan Kaufman, 1988. ISBN: 0-934613-57-5.
20 世纪 60 年代初,恩格尔巴特预见到计算机系统将来会成为个人工具,发挥各种作用。但不幸的是,当时的技术——价值数百万美元的大型机——过于昂贵,无法实现他的设想。如今的个人计算机和工程工作站已经融入了他的许多理念。
In the early 1960s Engelbart saw that computer systems would someday be useful in myriad ways as personal tools. Unfortunately, the technology of his time, multimillion-dollar mainframes, was far too expensive to make his vision practical. Today’s personal computers and engineering workstations have now incorporated many of his ideas.
1.7.5F[ernando] J. Corbató 和 V[ictor] A. Vyssotsky。Multics 系统简介和概述。FIPS 1965 秋季联合计算机会议 27,第一部分 (1965),第 185-196 页。
1.7.5 F[ernando] J. Corbató and V[ictor] A. Vyssotsky. Introduction and overview of the Multics system. AFIPS 1965 Fall Joint Computer Conference 27, part I (1965), pages 185–196.
Corbató 和他的同事从几个原始的分时系统示例入手,将这一设想提升为一个包罗万象的计算机实用程序。本文是同一论文集第 185-247 页中关于 Multics 的六篇论文中的第一篇。
Working from a few primitive examples of time-sharing systems, Corbató and his associates escalated the vision to an all-encompassing computer utility. This paper is the first in a set of six about Multics in the same proceedings, pages 185–247.
1.8.1Jack B. Dennis 和 Earl C. Van Horne。多道程序计算的编程语义。ACM通讯 9,3(1966 年 3 月),第 143-155 页。
1.8.1 Jack B. Dennis and Earl C. Van Horne. Programming semantics for multiprogrammed computations. Communications of the ACM 9, 3 (March 1966), pages 143–155.
本文为思考并发活动制定了基本规则,包括词汇和语义。
This paper set the ground rules for thinking about concurrent activities, both the vocabulary and the semantics.
1.8.2JS Liptay。System/360 模型 85 的结构方面:II。缓存。IBM Systems Journal 7,1(1968),第 15-21 页。
1.8.2 J. S. Liptay. Structural aspects of the System/360 model 85: II. The cache. IBM Systems Journal 7, 1 (1968), pages 15–21.
大约在 1963 年,Francis Lee 和 Maurice Wilkes 分别提出了缓存、后备存储器或从属存储器的概念,但直到 LSI 技术出现后,才真正能够在硬件中构建缓存。因此,没有人认真探索设计空间选项,直到 IBM System/360 model 85 的设计人员不得不提出一个真正的实施方案。这篇论文发表后,缓存成为后来大多数计算机架构的必备条件。
The idea of a cache, look-aside, or slave memory had been suggested independently by Francis Lee and Maurice Wilkes some time around 1963, but it was not until the advent of LSI technology that it became feasible to actually build one in hardware. As a result, no one had seriously explored the design space options until the designers of the IBM System/360 model 85 had to come up with a real implementation. Once this paper appeared, a cache became a requirement for most later computer architectures.
1.8.3克劳德·E·香农。《保密系统的通信理论》。《贝尔系统技术期刊》28,4(1949 年 10 月),第 656-715 页。本文从信息论的角度阐述了密码学理论的基础。
1.8.3 Claude E. Shannon. The communication theory of secrecy systems. Bell System Technical Journal 28, 4 (October 1949), pages 656–715.This paper provides the underpinnings of the theory of cryptography, in terms of information theory.
1.8.4Whitfield Diffie 和 Martin E. Hellman。隐私和身份验证:密码学简介。IEEE 67 论文集,3(1979 年 3 月),第 397-427 页。
1.8.4 Whitfield Diffie and Martin E. Hellman. Privacy and authentication: An introduction to cryptography. Proceedings of the IEEE 67, 3 (March 1979), pages 397–427.
这是自香农以来非机密文献中第一篇技术上合格的密码学论文,它开启了现代非机密研究。它包含完整的学术参考书目。
This is the first technically competent paper on cryptography since Shannon in the unclassified literature, and it launched modern unclassified study. It includes a complete and scholarly bibliography.
1.8.5Whitfield Diffie 和 Martin E. Hellman。密码学的新方向。IEEE信息理论汇刊 IT-22,6(1976 年 11 月),第 644-654 页。
1.8.5 Whitfield Diffie and Martin E. Hellman. New directions in cryptography. IEEE Transactions on Information Theory IT-22, 6 (November 1976), pages 644–654.
Diffie 和 Hellman 是公钥密码学的第二位发明者(第一位发明者 James H. Ellis 当时正在为英国政府通信总部从事机密项目,直到 1970 年才得以发表其成果)。这篇论文将这一想法引入了非机密世界。
Diffie and Hellman were the second inventors of public key cryptography (the first inventor, James H. Ellis, was working on classified projects for the British Government Communications Headquarters at the time, in 1970, and was not able to publish his work until 1987). This is the paper that introduced the idea to the unclassified world.
1.8.6Charles T. Davies, Jr. 数据处理控制范围。IBM Systems Journal 17,2(1978 年),第 179-198 页。Charles T. Davies, Jr. DB/DC 系统的恢复语义。1973 ACM 全国会议 28(1973 年 8 月),第 136-141 页。
1.8.6 Charles T. Davies, Jr. Data processing spheres of control. IBM Systems Journal 17, 2 (1978), pages 179–198. Charles T. Davies, Jr. Recovery semantics for a DB/DC system. 1973 ACM National Conference 28 (August 1973), pages 136–141.
这两篇论文虽然含糊不清,但发人深省,对“控制范围”进行了高层次的讨论,这一概念与原子性密切相关。每个撰写有关交易的文章的人都提到,这两篇论文对他们很有启发。
This pair of papers—vague but thought-provoking—gives a high-level discussion of “spheres of control”, a notion closely related to atomicity. Everyone who writes about transactions mentions that they found these two papers inspiring.
1.8.7Butler W. Lampson 和 Howard Sturgis。分布式数据存储系统中的崩溃恢复。工作论文,施乐帕洛阿尔托研究中心,1976 年 11 月和 1979 年 4 月。(从未发表)Jim Gray 称这篇论文的 1976 年版本是“地下经典”。1979 年版本首次对故障模型给出了良好的定义。两篇论文都描述了协调分布式更新的算法;它们有很大不同,值得一读。
1.8.7 Butler W. Lampson and Howard Sturgis. Crash recovery in a distributed data storage system. Working paper, Xerox Palo Alto Research Center, November 1976 and April 1979. (Never published)Jim Gray called the 1976 version of this paper “an underground classic”. The 1979 version presents the first good definition of models of failure. Both describe algorithms for coordinating distributed updates; they are sufficiently different that both are worth reading.
1.8.8Leonard Kleinrock。通信网络:随机消息流和延迟。McGraw-Hill,1964 年。Dover 于 2007 年重新出版。ISBN:0-486-45880-6。224 页。
1.8.8 Leonard Kleinrock. Communication Nets: Stochastic Message Flow and Delay. McGraw-Hill, 1964. Republished by Dover, 2007. ISBN: 0-486-45880-6. 224 pages.
1.8.9Paul Baran、S. Boehm 和 JW Smith。《论分布式通信》。兰德公司 11 份备忘录系列,加利福尼亚州圣莫尼卡,1964 年 8 月。
1.8.9 Paul Baran, S. Boehm, and J. W. Smith. On Distributed Communications. A series of 11 memoranda of the RAND Corporation, Santa Monica, California, August 1964.
自从互联网普及以来,关于谁是第一个想到分组交换的讨论一直很多。看来,Leonard Kleinrock 和 Paul Baran 及其在兰德公司的同事分别在 1961 年和 1961 年分别提出了分组交换的想法。Leonard Kleinrock 当时正在撰写麻省理工学院的博士论文,论文主题是更有效地使用有线网络,Paul Baran 和他在兰德公司的同事当时正在研究可存活通信。两人都在 1961 年撰写了内部备忘录,描述了他们的想法。然而,他们都没有真正使用过“分组交换”这个词。几年后,国家物理实验室的 Donald Davies 创造了这个标签。
Since the growth in the Internet’s popularity, there has been considerable discussion about who first thought of packet switching. It appears that Leonard Kleinrock, working in 1961 on his M.I.T. Ph.D. thesis on more effective ways of using wired networks, and Paul Baran and his colleagues at Rand, working in 1961 on survivable communications, independently proposed the idea of packet switching at about the same time; both wrote internal memoranda in 1961 describing their ideas. Neither one actually used the words “packet switching”, however; that was left to Donald Davies of the National Physical Laboratory, who coined that label several years later.
1.8.10Lawrence G. Roberts 和 Barry D. Wessler。计算机网络开发以实现资源共享。AFIPS春季联合计算机会议 36(1970 年 5 月),第 543–549 页。
1.8.10 Lawrence G. Roberts and Barry D. Wessler. Computer network development to achieve resource sharing. AFIPS Spring Joint Computer Conference 36 (May 1970), pages 543–549.
这篇论文和在同一次会议上发表的其他四篇论文(第 543-597 页)首次公开描述了 ARPANET,这是第一个成功的分组交换网络和互联网的原型。两年后,AFIPS 春季联合计算机会议 40 (1972 年),第 243-298 页,发表了另外五篇密切相关的论文。关于阅读1.8.8和阅读1.8.9的优先权的讨论有些学术性;正是罗伯茨对 ARPANET 的赞助证明了分组交换的可行性。
This paper and four others presented at the same conference session (pages 543–597) represent the first public description of the ARPANET, the first successful packet-switching network and the prototype for the Internet. Two years later, AFIPS Spring Joint Computer Conference 40 (1972), pages 243–298, presented five additional, closely related papers. The discussion of priority concerning reading 1.8.8 and reading 1.8.9 is somewhat academic; it was Roberts’s sponsorship of the ARPANET that demonstrated the workability of packet switching.
1.8.11V[inton G.] Cerf 等人。延迟容忍网络架构。征求意见 RFC 4838,互联网工程任务组(1997 年 4 月)。
1.8.11 V[inton G.] Cerf et al. Delay-Tolerant Networking Architecture. Request for Comments RFC 4838, Internet Engineering Task Force (April 1997).
本文档描述了一种从星际互联网愿景发展而来的架构,星际互联网是一种用于星际距离的类似互联网的网络。本文档介绍了几个有趣的想法,并强调了人们在设计网络时没有意识到的一些假设。NASA 对延迟容忍网络的原型实现进行了首次成功测试。
This document describes an architecture that evolved from a vision for an Interplanetary Internet, an Internet-like network for interplanetary distances. This document introduces several interesting ideas and highlights some assumptions that people make in designing networks without realizing it. NASA performed its first successful tests of a prototype implementation of a delay-tolerant network.
1.8.12Jim Gray 等人。Terascale Sneakernet。使用廉价磁盘进行备份、存档和数据交换。Microsoft 技术报告 MS-TR-02-54(2002 年 5 月)。http ://arxiv.org/pdf/cs/0208011潜行者网络是一个通用术语,指通过物理方式交付存储设备而不是通过电线发送数据。当数据量非常大以至于电子传输需要很长时间或成本太高,而第一个字节到达之前的延迟不那么重要时,潜行者网络就很有吸引力。早期的潜行者网络使用软盘交换程序和数据。最近,人们通过刻录 CD 并随身携带来交换数据。本文提出通过发送硬盘来构建潜行者网络,硬盘封装在小型低成本计算机(称为存储块)中。这种方法允许人们在几天内通过邮件将数 TB 的数据传送到全球各地。由于包括计算机和操作系统,它可以最大限度地减少将数据传输到另一台计算机时出现的兼容性问题。
1.8.12 Jim Gray et al. Terascale Sneakernet. Using Inexpensive Disks for Backup, Archiving, and Data Exchange. Microsoft Technical Report MS-TR-02-54 (May 2002). http://arxiv.org/pdf/cs/0208011Sneakernet is a generic term for transporting data by physically delivering a storage device rather than sending it over a wire. Sneakernets are attractive when data volume is so large that electronic transport will take a long time or be too expensive, and the latency until the first byte arrives is less important. Early sneakernets exchanged programs and data using floppy disks. More recently, people have exchanged data by burning CDs and carrying them. This paper proposes to build a sneakernet by sending hard disks, encapsulated in a small, low-cost computer called a storage brick. This approach allows one to transfer by mail terabytes of data across the planet in a few days. By virtue of including a computer and operating system, it minimizes compatibility problems that arise when transferring the data to another computer.
在具体主题下列出的其他几篇论文也提供了全面的新视角或改变了人们对系统的看法:Simon 的《复杂性的架构》,阅读1.4.3;Thompson 的《对信任的思考》,阅读11.3.3;Lampson 的《计算机系统设计的提示》,阅读1.5.4;以及 Creasy 的 VM/370 论文,阅读5.6.1。
Several other papers listed under specific topics also provide sweeping new looks or have changed the way people think about systems: Simon, The architecture of complexity, reading 1.4.3; Thompson, Reflections on trusting trust, reading 11.3.3; Lampson, Hints for computer system design, reading 1.5.4; and Creasy’s VM/370 paper, reading 5.6.1.
1.9.1F[ernando] J. Corbató 和 C[harles] T. Clingen。Multics 系统开发的管理视角。摘自 Peter Wegner 的《软件技术研究方向》,麻省理工学院出版社,马萨诸塞州剑桥,1979 年,第 139-158 页。ISBN:0-262-23096-8。
1.9.1 F[ernando] J. Corbató and C[harles] T. Clingen. A managerial view of the Multics system development. In Peter Wegner, Research Directions in Software Technology, M.I.T. Press, Cambridge, Massachusetts, 1979, pages 139–158. ISBN: 0-262-23096-8.
1.9.2W[illiam A.] Wulf、R[oy] Levin 和 C. Pierson。Hydra 操作系统开发概述。第五届 ACM 操作系统原理研讨会论文集,载于《操作系统评论》第 9、5期(1975 年 11 月),第 122-131 页。
1.9.2 W[illiam A.] Wulf, R[oy] Levin, and C. Pierson. Overview of the Hydra operating system development. Proceedings of the Fifth ACM Symposium on Operating Systems Principles, in Operating Systems Review 9, 5 (November 1975), pages 122–131.
1.9.3Thomas R. Horsley 和 William C. Lynch。Pilot:软件工程案例研究。第四届国际软件工程大会(1979 年 9 月),第 94-99 页。
1.9.3 Thomas R. Horsley and William C. Lynch. Pilot: A software engineering case study. Fourth International Conference on Software Engineering (September 1979), pages 94–99.
这三篇论文是对管理和开发大型系统所面临的挑战的早期描述。它们仍然具有现实意义且易于阅读,并提供了互补的见解。
These three papers are early descriptions of the challenges of managing and developing large systems. They are still relevant and easy to read, and provide complementary insights.
1.9.4Effy Oz。《当专业标准松懈时:CONFIRM 的失败及其教训》。《ACM 通讯》37,10(1994 年 10 月),第 30-36 页。
1.9.4 Effy Oz. When professional standards are lax: The CONFIRM failure and its lessons. Communications of the ACM 37, 10 (October 1994), pages 30–36.
CONFIRM 是一个航空公司/酒店/租车预订系统,尽管经过了四年的努力和超过 1 亿美元的投资,但最终还是未能面世。它是许多失控的计算机系统开发之一,最终被丢弃而未投入使用。每年都会有几次新闻报道类似规模的软件灾难。很难获得有关系统开发失败的确凿事实,因为没有人愿意承担责任,尤其是在诉讼悬而未决的情况下。本文缺乏事实,并且过于简单地建议只需更好的道德规范即可解决问题。(似乎道德和管理问题只是延迟了对不可避免的事情的认识。)尽管如此,它还是提供了一个清醒的视角,表明事情可能会变得多么糟糕。
CONFIRM is an airline/hotel/rental-car reservation system that never saw the light of day despite four years of work and an investment of more than $100M. It is one of many computer system developments that went out of control and finally were discarded without ever having been placed in service. One sees news reports of software disasters of similar magnitude a few times each year. It is difficult to obtain solid facts about system development failures because no one wants to accept the blame, especially when lawsuits are pending. This paper suffers from a shortage of facts and an oversimplistic recommendation that better ethics are all that are needed to solve the problem. (It seems likely that the ethics and management problems simply delayed recognition of the inevitable.) Nevertheless, it provides a sobering view of how badly things can go wrong.
1.9.5Nancy G. Leveson 和 Clark S. Turner。Therac-25 事故调查。Computer 26,7(1993 年 7 月),第 18-41 页。(重印于阅读1.3.6。)
1.9.5 Nancy G. Leveson and Clark S. Turner. An investigation of the Therac-25 accidents. Computer 26, 7 (July 1993), pages 18–41. (Reprinted in reading 1.3.6.)
这是另一个令人警醒的观点,表明事情可能会变得多么糟糕。在本案中,高能医疗设备的软件控制器设计不当;该设备投入使用,并导致致命伤害。本文设法深入探究问题的根源。不幸的是,此后又犯了类似的错误;例如,参见美国核管理委员会信息通知 2001-8s1(2001 年 6 月),其中描述了巴拿马的放射治疗过度暴露。
This is another sobering view of how badly things can go wrong. In this case, the software controller for a high-energy medical device was inadequately designed; the device was placed in service, and lethal injuries ensued. This paper manages to inquire quite deeply into the source of the problems. Unfortunately, similar mistakes have been made since; see, for example, United States Nuclear Regulatory Commission Information Notice 2001-8s1 (June 2001), which describes radiation therapy overexposures in Panama.
1.9.6乔·摩根斯坦。《城市危机:五十九层楼的危机》。《纽约客》71,14(1995 年 5 月 29 日),第 45-53 页。本文讨论了一位工程师在得知自己设计的摩天大楼有被飓风倒塌的危险时的反应。
1.9.6 Joe Morgenstern. City perils: The fifty-nine-story crisis. The New Yorker 71, 14 (May 29, 1995), pages 45–53.This article discusses how an engineer responded to the realization that a skyscraper he had designed was in danger of collapsing in a hurricane.
1.9.7Eric S. Raymond。大教堂和集市。《大教堂和集市:一位偶然的革命者对 Linux 和开源的思考》第 19-64 页。O'Reilly Media Inc.,2001 年。ISBN:978-0596001087,241 页。
1.9.7 Eric S. Raymond. The cathedral and the bazaar. In The Cathedral and The Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary, pages 19–64. O’Reilly Media Inc., 2001. ISBN: 978-0596001087, 241 pages.
本书基于同名白皮书,该白皮书比较了两种软件开发风格:大教堂模式,主要由商业软件公司和一些开源项目使用,例如 BSD 操作系统;以及集市模式,以 GNU/Linux 操作系统的开发为例。该书认为,集市模式可以开发出更好的软件,因为集市的开放性和独立性允许任何人成为参与者,并查看系统中任何看起来有趣的内容:“只要有足够的眼光,所有错误都是浅显的”。
The book is based on a white paper of the same title that compares two styles of software development: the Cathedral model, which is used mostly by commercial software companies and some open-source projects such as the BSD operating system; and the Bazaar model, which is exemplified by development of the GNU/Linux operating system. The work argues that the Bazaar model leads to better software because the openness and independence of Bazaar allow anyone to become a participant and to look at anything in the system that seems of interest: “Given enough eyeballs, all bugs are shallow”.
1.9.8Philip M Boffey。调查人员一致认为 1977 年纽约大停电是可以避免的。《科学》201,4360(1978 年 9 月 15 日),第 994-996 页。
1.9.8 Philip M Boffey. Investigators agree N. Y. blackout of 1977 could have been avoided. Science 201, 4360 (September 15, 1978), pages 994–996.
这是一个令人着迷的描述,关于纽约联合爱迪生公司的发电和配电系统如何崩溃:两个原本可以容忍的故障接连发生,恢复机制没有按预期工作,手动恢复的尝试因系统的复杂性而受阻,最后事情失去了控制。
This is a fascinating description of how the electrical generation and distribution system of New York’s Consolidated Edison fell apart when two supposedly tolerable faults occurred in close succession, recovery mechanisms did not work as expected, attempts to recover manually got bogged down by the system’s complexity, and finally things cascaded out of control.
要了解有关内存和解释器的基本抽象的更多信息, Patterson 和 Hennessy 合著的《计算机体系结构》 (阅读 1.1.1 )一书是最好的资料之一。有关第三个基本抽象(通信链路)的更多信息,请参阅阅读第 7 节。
To learn more about the basic abstractions of memory and interpreters, the book Computer Architecture by Patterson and Hennessy (reading 1.1.1) is one of the best sources. Further information about the third basic abstraction, communication links, can be found in readings section 7.
2.1.1Bruce [G.] Lindsay。分布式数据库管理器的对象命名和目录管理。第二届国际分布式计算系统会议论文集,法国巴黎(1981 年 4 月),第 31-40 页。另请参阅 IBM 圣何塞研究实验室技术报告 RJ2914(1980 年 8 月)。17 页。本文是关于数据库系统中名称使用的教程,首先介绍一个高于平均水平的需求陈述,然后演示如何在 R* 分布式数据库管理系统中满足这些需求。
2.1.1 Bruce [G.] Lindsay. Object naming and catalog management for a distributed database manager. Proceedings of the Second International Conference on Distributed Computing Systems, Paris, France (April 1981), pages 31–40. Also IBM San Jose Research Laboratory Technical Report RJ2914 (August 1980). 17 pages.This paper, a tutorial treatment of names as used in database systems, begins with a better-than-average statement of requirements and then demonstrates how those requirements were met in the R* distributed database management system.
2.1.2Yogen K. Dalal 和 Robert S. Printis。48 位绝对 Internet 和以太网主机号。墨西哥墨西哥城第七届数据通信研讨会论文集(1981 年 10 月),第 240-245 页。另请参阅施乐办公产品部门技术报告 OPD-T8101(1981 年 7 月),14 页。
2.1.2 Yogen K. Dalal and Robert S. Printis. 48-bit absolute Internet and Ethernet host numbers. Proceedings of the Seventh Data Communications Symposium, Mexico City, Mexico, (October 1981), pages 240–245. Also Xerox Office Products Division Technical Report OPD-T8101 (July 1981), 14 pages.
本文介绍了以太网局域网中硬件地址的处理方式。
This paper describes how hardware addresses are handled in the Ethernet local area network.
2.1.3西奥多·霍尔姆·纳尔逊。《文学机器》,第 87.1 版。《Xanadu 项目》,德克萨斯州圣安东尼奥,1987 年。ISBN:0-89347-056-2(平装本)。各种页码。
2.1.3 Theodor Holm Nelson. Literary Machines, Ed. 87.1. Project Xanadu, San Antonio, Texas, 1987. ISBN: 0-89347-056-2 (paperback). Various pagings.
Xanadu 项目是一个雄心勃勃的未来愿景,其中书籍将被以命名网络形式组织的信息所取代,这种形式如今被称为“超文本”。书籍在某种程度上是非线性的,是纳尔逊所提倡的一个原始例子。
Project Xanadu is an ambitious vision of a future in which books are replaced by information organized in the form of a naming network, in the form that today is called “hypertext”. The book, being somewhat non-linear, is a primitive example of what Nelson advocates.
以下阅读材料和 Marshall McKusick 等人的书籍阅读材料1.3.4是UNIX系统的绝佳资源,可用于跟进第 2.5 节中的案例研究。Tanenbaum 的操作系统书籍 [阅读材料 1.2.1 ]中可以找到有关其主要特性的良好、简洁的摘要,其中还介绍了 Linux。
The following readings and the book by Marshall McKusick et al., reading 1.3.4, are excellent sources on the UNIX system to follow up the case study in Section 2.5. A good, compact summary of its main features can be found in Tanenbaum’s operating systems book [reading 1.2.1], which also covers Linux.
2.2.1Dennis M. Ritchie 和 Ken [L.] Thompson。UNIX分时系统。Bell System Technical Journal 57,6,第 2 部分 (1978),第 1905-1930 页。
2.2.1 Dennis M. Ritchie and Ken [L.] Thompson. The UNIX time-sharing system. Bell System Technical Journal 57, 6, part 2 (1978), pages 1905–1930.
本文介绍了一种有影响力的操作系统,其目标非常低调,但经过精心选择且难以发现。该系统提供了分层目录结构,并成功地将命名与文件管理完全区分开来。本文的早期版本出现在ACM 通讯 17 , 7 (1974 年 7 月) 第 365-375 页,是在第四届 ACM 操作系统原理研讨会上发表的。UNIX系统在1973年至 1978 年间发展迅速,因此BSTJ版本虽然更难找到,但包含重要的补充,无论是在洞察力还是技术内容方面。
This paper describes an influential operating system with very low-key, but carefully chosen and hard-to-discover, objectives. The system provides a hierarchical catalog structure and succeeds in keeping naming completely distinct from file management. An earlier version of this paper appeared in the Communications of the ACM 17, 7 (July, 1974), pages 365–375, after being presented at the Fourth ACM Symposium on Operating Systems Principles. The UNIX system evolved rapidly between 1973 and 1978, so the BSTJ version, though harder to find, contains significant additions, both in insight and in technical content.
2.2.2John Lions。Lions ' Commentary on UNIX第 6 版(带源代码)。点对点通信,1977 年。ISBN:978-1-57398-013-7,254 页。
2.2.2 John Lions. Lions’ Commentary on UNIX 6th Edition with Source Code. Peer-to-peer communications, 1977. ISBN: 978-1-57398-013-7, 254 pages.
本书包含UNIX版本 6 的源代码,并附有注释来解释其工作原理。尽管版本 6 已经过时,但本书仍然是从内部了解系统工作原理的绝佳起点,因为源代码和注释都很简短而简洁。几十年来,这本书一直是设计人员了解 UNIX 系统的地下文献的一部分,但现在它已向公众开放。
This book contains the source code for UNIX Version 6, with comments to explain how it works. Although Version 6 is old, the book remains an excellent starting point for understanding how the system works from the inside, because both the source code and the comments are short and succinct. For decades, this book was part of the underground literature from which designers learned about the UNIX system, but now it is available to the public.
几乎任何系统都有命名方案,许多有趣的命名方案都可以在描述大型系统的论文中找到。任何对命名感兴趣的读者都应该研究域名系统(阅读4.3 )和第 4.4 节的主题。
Almost any system has a naming plan, and many of the interesting naming plans can be found in papers that describe a larger system. Any reader interested in naming should study the Domain Name System, reading 4.3, and the topic of Section 4.4.
一些早期的资料仍然包含一些最容易理解的设计解释,这些设计直接在硬件中融入了高级命名功能。
Several early sources still contain some of the most accessible explanations of designs that incorporate advanced naming features directly in hardware.
3.1.1杰克·B·丹尼斯。多道程序计算机系统的分段和设计。《ACM 杂志》12,4(1965 年 10 月),第 589-602 页。
3.1.1 Jack B. Dennis. Segmentation and the design of multiprogrammed computer systems. Journal of the ACM 12, 4 (October 1965), pages 589–602.
这是概述在硬件架构中提供命名支持的优势的原始论文。
This is the original paper outlining the advantages of providing naming support in hardware architecture.
3.1.2R[obert] S. Fabry。基于能力的寻址。ACM通讯 17,7(1974 年 7 月),第 403-412 页。
3.1.2 R[obert] S. Fabry. Capability-based addressing. Communications of the ACM 17, 7 (July 1974), pages 403–412.
这是对功能的首次全面处理,它是一种为强制模块化而引入的机制,但实际上更像是一种命名功能。
This is the first comprehensive treatment of capabilities, a mechanism introduced to enforce modularity but actually more of a naming feature.
3.1.3Elliott I. Organick。计算机系统组织,B5700/B6700 系列。Academic Press,1973 年。ISBN:0-12-528250-8。132 页。
3.1.3 Elliott I. Organick. Computer System Organization, The B5700/B6700 Series. Academic Press, 1973. ISBN: 0-12-528250-8. 132 pages.
本书解释的 Burroughs 描述符系统显然是在微编程出现之前实际实现的硬件支持命名系统的唯一示例。
The Burroughs Descriptor system explained in this book is apparently the only example of a hardware-supported naming system actually implemented before the advent of microprogramming.
3.1.4Elliott I. Organick。Multics系统:结构检查。麻省理工学院出版社,马萨诸塞州剑桥,1972 年。ISBN:0-262-15012-3。392 页。
3.1.4 Elliott I. Organick. The Multics System: an Examination of its Structure. M.I.T. Press, Cambridge, Massachusetts, 1972. ISBN: 0-262-15012-3. 392 pages.
本书探讨了 Multics 的广泛命名机制的每一个细节和后果,包括寻址架构和文件系统。
This book explores every detail and ramification of the extensive naming mechanisms of Multics, both in the addressing architecture and in the file system.
3.1.5R[oger] M. Needham 和 A[ndrew] D. Birrell。CAP 归档系统。《第六届 ACM 操作系统原理研讨会论文集》,载《操作系统评论》第 11 卷第 5 期(1977 年 11 月),第 11-16 页。
3.1.5 R[oger] M. Needham and A[ndrew] D. Birrell. The CAP filing system. Proceedings of the Sixth ACM Symposium on Operating Systems Principles, in Operating Systems Review 11, 5 (November 1977), pages 11–16.
CAP 文件系统是真正命名网络的少数实现示例之一。
The CAP file system is one of the few implemented examples of a genuine naming network.
3.2.1Paul J. Leach、Bernard L. Stumpf、James A. Hamilton 和 Paul H. Levine。《UID 作为分布式文件系统中的内部名称》。《ACM SIGACT-SIGOPS 分布式计算原理研讨会》,安大略省渥太华(1982 年 8 月 18-20 日),第 34-41 页。
3.2.1 Paul J. Leach, Bernard L. Stumpf, James A. Hamilton, and Paul H. Levine. UIDs as internal names in a distributed file system. In ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, Ottawa, Ontario (August 18-20, 1982), pages 34–41.
Apollo DOMAIN系统支持不同的分布式功能模型。它提供了一种称为单级存储的共享主内存,该内存可在整个网络中透明扩展。它也是少数几个充分利用紧凑集合中的非结构化唯一标识符作为对象名称的系统之一。本文重点介绍后者。
The Apollo DOMAIN system supports a different model for distributed function. It provides a shared primary memory called the Single Level Store, which extends transparently across the network. It is also one of the few systems to make substantial use of unstructured unique identifiers from a compact set as object names. This paper focuses on this latter issue.
3.2.2Rob Pike 等人。贝尔实验室的 Plan 9。计算系统 8,3(1995 年夏季),第 221-254 页。Rob Pike、Dave Presotto、Ken Thompson 和 Howard Trickey 编写的早期版本出现在1990 年夏季 UKUUG 会议论文集(1990 年),伦敦,第 1-9 页。本文介绍了一种分布式操作系统,它进一步发展了UNIX系统“每个资源都是一个文件”的理念,并将其用于网络和窗口系统交互。它还通过定义一个文件系统协议来访问所有资源(无论是本地资源还是远程资源),将文件理念扩展到分布式系统。进程可以将任何远程资源装入其名称空间,对于用户来说,这些远程资源的行为就像本地资源一样。这种设计使用户将系统视为一个易于使用的分时系统,其行为就像一台功能强大的计算机,而不是一组独立的计算机。
3.2.2 Rob Pike et al. Plan 9 from Bell Labs. Computing Systems 8, 3 (Summer 1995), pages 221–254. An earlier version by Rob Pike, Dave Presotto, Ken Thompson, and Howard Trickey appeared in Proceedings of the Summer 1990 UKUUG Conference (1990), London, pages 1–9.This paper describes a distributed operating system that takes the UNIX system idea that every resource is a file one step further by using it also for network and window system interactions. It also extends the file idea to a distributed system by defining a single file system protocol for access to all resources, whether they are local or remote. Processes can mount any remote resources into their name space, and to the user these remote resources behave just like local resources. This design makes users perceive the system as an easy-to-use time-sharing system that behaves like a single powerful computer, instead of a collection of separate computers.
3.2.3Tim Berners-Lee 等人。万维网。《ACM 通讯》37,8(1994 年 8 月),第 76-82 页。
3.2.3 Tim Berners-Lee et al. The World Wide Web. Communications of the ACM 37, 8 (August 1994), pages 76–82.
很多有关万维网的出版物只能在网络上找到,而万维网联盟的主页就是一个很好的起点,网址为< http://w3c.org/ >。
Many of the publications about the World Wide Web are available only on the Web, with a good starting point being the home page of the World Wide Web Consortium at <http://w3c.org/>.
3.2.4谢尔盖·布林和劳伦斯·佩奇。《大型超文本网络搜索引擎的剖析》。澳大利亚布里斯班第 7 届 WWW 会议论文集(1998 年 4 月)。另见《计算机网络》第 30 卷(1998 年),第 107-117 页。
3.2.4 Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Proceedings of the 7th WWW Conference, Brisbane, Australia (April 1998). Also in Computer Networks 30 (1998), pages 107–117.
本文介绍了 Google 搜索引擎的早期版本。它还引入了页面排名的概念,以便按重要性对查询结果进行排序。搜索是用户“命名”网页的主要方式。
This paper describes an early version of Google’s search engine. It also introduces the idea of page rank to sort the results to a query in order of importance. Search is a dominant way in which users “name” Web pages.
3.2.5Bryan Ford 等。《全球联网移动设备的持久个人名称》。《第七届 USENIX 操作系统设计与实现研讨会论文集》(2006 年 11 月),第 233–248 页。
3.2.5 Bryan Ford et al. Persistent personal names for globally connected mobile devices. Proceedings of the Seventh USENIX Symposium on Operating Systems Design and Implementation (November 2006), pages 233–248.
本文介绍了一种个人设备命名系统。每台设备都是其自身命名网络的根,可以使用简短、方便的名称来命名属于同一用户或属于该用户社交网络中人员的其他设备。命名系统的实现允许设备与互联网断开连接并解析可访问设备的名称。前五页列出了基本的命名计划。后面的部分解释了安全属性和基于安全的实现,其中涉及第 11 章 [在线] 的材料。
This paper describes a naming system for personal devices. Each device is a root of its own naming network and can use short, convenient names for other devices belonging to the same user or belonging to people in the user’s social network. The implementation of the naming system allows devices to be disconnected from the Internet and resolve names of devices that are reachable. The first five pages lay out the basic naming plan. Later sections explain security properties and a security-based implementation, which involves material of Chapter 11 [on-line].
许多系统采用客户端/服务模式组织。网络文件系统 (参见第 4.5 节) 就是一个提供良好案例研究的系统。以下论文提供了一些其他示例。
Many systems are organized in a client/service style. A system that provides a good case study is the Network File System (see Section 4.5). The following papers provide some other examples.
4.1.1Andrew D. Birrell 和 Bruce Jay Nelson。实现远程过程调用。ACM Transactions on Computer Systems 2 , 1(1984 年 2 月),第 39-59 页。
4.1.1 Andrew D. Birrell and Bruce Jay Nelson. Implementing remote procedure calls. ACM Transactions on Computer Systems 2, 1 (February 1984), pages 39–59.
一篇写得好的论文,首先要展示基本思想的简单性,其次要展示处理实际实施所需的复杂性,第三要展示高效所需的改进。
A well-written paper that shows first, the simplicity of the basic idea, second, the complexity required to deal with real implementations, and third, the refinements needed for high effectiveness.
4.1.2Andrew Birrell、Greg Nelson、Susan Owicki 和 Edward Wobber。网络对象。《第十四届 ACM 操作系统原理研讨会论文集》,载于《操作系统评论》第 27 卷第 5 期(1993 年 12 月),第 217-230 页。
4.1.2 Andrew Birrell, Greg Nelson, Susan Owicki, and Edward Wobber. Network objects. Proceedings of the Fourteenth ACM Symposium on Operating Systems Principles, in Operating Systems Review 27, 5 (December 1993), pages 217–230.
本文介绍了一种基于远程过程调用的分布式应用程序编程语言,它向程序员隐藏了大部分的“分布式”。
This paper describes a programming language for distributed applications based on remote procedure calls, which hide most “distributedness” from the programmer.
4.1.3Ann Wollrath、Roger Riggs 和 Jim Waldo。Java™ 系统的分布式对象模型。计算系统 9,4(1996 年),第 265-290 页。最初发表于第二届 USENIX 面向对象技术会议论文集第 2 卷(1996 年)。
4.1.3 Ann Wollrath, Roger Riggs, and Jim Waldo. A distributed object model for the Java™ system. Computing Systems 9, 4 (1996), pages 265–290. Originally published in Proceedings of the Second USENIX Conference on Object-Oriented Technologies Volume 2 (1996).
本文介绍了一种用于 Java 编程语言的远程过程调用系统。它清晰地描述了 RPC 系统如何与面向对象编程语言集成以及 RPC 引入的新异常类型。
This paper presents a remote procedure call system for the Java programming language. It provides a clear description of how an RPC system can be integrated with an object-oriented programming language and the new exception types RPC introduces.
4.2.1Daniel Swinehart、Gene McDaniel 和 David [R.] Boggs。WFS:一种适用于分布式环境的简单共享文件系统。《第七届 ACM 操作系统原理研讨会论文集》,载于《操作系统评论》第 13 卷,第 5 期(1979 年 12 月),第 9-17 页。
4.2.1 Daniel Swinehart, Gene McDaniel, and David [R.] Boggs. WFS: A simple shared file system for a distributed environment. Proceedings of the Seventh ACM Symposium on Operating Systems Principles, in Operating Systems Review 13, 5 (December 1979), pages 9–17.
这个远程文件系统的早期版本为跨连接的协作计算机分配功能这一主题打开了大门。作者的具体目标是保持简单;因此,机制和目标之间的关系比更现代但更复杂的系统要清晰得多。
This early version of a remote file system opens the door to the topic of distribution of function across connected cooperating computers. The authors’ specific goal was to keep things simple;thus, the relationship between mechanism and goal is much clearer than in more modern, but more elaborate, systems.
4.2.2Robert Scheifler 和 James Gettys。X Window 系统。ACM Transactions on Graphics 5,2(1986 年 4 月),第 79-109 页。
4.2.2 Robert Scheifler and James Gettys. The X Window System. ACM Transactions on Graphics 5, 2 (April 1986), pages 79–109.
X Window 系统是世界上几乎所有工程工作站的首选窗口系统。它提供了使用客户端/服务模型实现模块化的一个很好的例子。X Window 系统的主要贡献之一是它弥补了显示器取代打字机时UNIX系统出现的缺陷:显示器和键盘是UNIX应用程序编程接口中唯一依赖于硬件的部分。X Window 系统允许面向显示的UNIX应用程序完全独立于底层硬件。此外,X Window 系统在应用程序和显示器之间插入了有效的网络连接,从而允许在分布式系统中实现配置灵活性。
The X Window System is the window system of choice on practically every engineering workstation in the world. It provides a good example of using the client/service model to achieve modularity. One of the main contributions of the X Window System is that it remedied a defect that had crept into the UNIX system when displays replaced typewriters: the display and keyboard were the only hardware-dependent parts of the UNIX application programming interface. The X Window System allowed display-oriented UNIX applications to be completely independent of the underlying hardware. In addition, the X Window System interposes an efficient network connection between the application and the display, allowing configuration flexibility in a distributed system.
4.2.3John H. Howard 等人。分布式文件系统的规模和性能。ACM计算机系统学报 6,1(1988 年 2 月),第 51-81 页。
4.2.3 John H. Howard et al. Scale and performance in a distributed file system. ACM Transactions on Computer Systems 6, 1 (February 1988), pages 51–81.
本文介绍了校园网络的 Andrew 网络文件系统原型的使用经验,并展示了该经验如何促使设计发生变化。Andrew 文件系统对 NFS 版本 4 产生了很大的影响。
This paper describes experience with a prototype of the Andrew network file system for a campus network and shows how the experience motivated changes in the design. The Andrew file system had strong influence on version 4 of NFS.
域名系统是目前运行的最有趣的分布式系统之一。它不仅是许多分布式应用程序的构建块,而且本身就是一个有趣的案例研究,为任何想要构建分布式系统或命名系统的人提供了很多见解。
The domain name system is one of the most interesting distributed systems in operation. It is not only a building block in many distributed applications, but is itself an interesting case study, offering many insights for anyone wanting to build a distributed system or a naming system.
4.3.1Paul V. Mockapetris 和 Kevin J. Dunlap。域名系统的开发。SIGCOMM 1988 研讨会论文集,第 123-133 页。还发表于ACM 计算机通信评论 18,4(1988 年 8 月),第 123-133 页,并重新发表于ACM 计算机通信评论 25,1(1995 年 1 月),第 112-122 页。
4.3.1 Paul V. Mockapetris and Kevin J. Dunlap. Development of the Domain Name System. Proceedings of the SIGCOMM 1988 Symposium, pages 123–133. Also published in ACM Computer Communications Review 18, 4 (August 1988), pages 123–133, and republished in ACM Computer Communications Review 25, 1 (January 1995), pages 112–122.
4.3.2Paul [V.] Mockapetris。域名——概念和设施。征求意见 RFC 1034,互联网工程任务组(1987 年 11 月)。
4.3.2 Paul [V.] Mockapetris. Domain names—Concepts and facilities. Request for Comments RFC 1034, Internet Engineering Task Force (November 1987).
4.3.3Paul [V.] Mockapetris。域名——实施和规范。征求意见 RFC 1035,互联网工程任务组(1987 年 11 月)。
4.3.3 Paul [V.] Mockapetris. Domain names—Implementation and specification. Request for Comments RFC 1035, Internet Engineering Task Force (November 1987).
这三个文档解释了DNS协议。
These three documents explain the DNS protocol.
4.3.4Paul Vixie。DNS 复杂性。ACM Queue 5,3(2007 年 4 月),第 24-29 页。
4.3.4 Paul Vixie. DNS Complexity. ACM Queue 5, 3 (April 2007), pages 24–29.
本文揭示了第 4.4 节案例研究中描述的 DNS 在实践中的许多复杂性。DNS 协议很简单,并且没有完整、精确的系统规范。作者认为,当前的 DNS 描述规范是一个优势,因为它允许各种实现不断发展以根据需要包含新功能。本文描述了其中的许多功能,并表明 DNS 是当今使用的最有趣的分布式系统之一。
This paper uncovers many of the complexities of how DNS, described in the case study in Section 4.4, works in practice. The protocol for DNS is simple, and no complete, precise specification of the system exists. The author argues that the current descriptive specification of DNS is an advantage because it allows various implementations to evolve to include new features as needed. The paper describes many of these features and shows that DNS is one of the most interesting distributed systems in use today.
UNIX系统上的读物(参见读物第 2.2 节)是研究内核的一个很好的起点。
The readings on the UNIX system (see readings section 2.2) are a good starting point for studying kernels.
5.1.1Per Brinch Hansen。多道程序设计系统的核心。ACM通讯 13,4(1970 年 4 月),第 238-241 页。RC-4000 是第一个使用消息作为主要并发协调机制的系统,并且可能至今仍是解释得最清楚的系统。它也是今天所谓的微内核设计。
5.1.1 Per Brinch Hansen. The nucleus of a multiprogramming system. Communications of the ACM 13, 4 (April 1970), pages 238–241.The RC–4000 was the first, and may still be the best explained, system to use messages as the primary concurrency coordination mechanism. It is also what would today be called a microkernel design.
5.1.2M. Frans Kaashoek 等人。外内核系统上的应用程序性能和灵活性。载于《第十六届 ACM 操作系统原理研讨会论文集》,载于《操作系统评论》第 31 卷,第 5 期(1997 年 12 月),第 52-65 页。
5.1.2 M. Frans Kaashoek et al. Application performance and flexibility on exokernel systems. In Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, in Operating Systems Review 31, 5 (December 1997), pages 52–65.
外内核提供了一种将策略与机制分离的极端版本,它牺牲抽象来将物理环境的所有可能的方面(在保护约束内)暴露给下一个更高层,从而为更高层在为其首选编程环境创建抽象或为其首选应用程序量身定制抽象方面提供了最大的灵活性。
The exokernel provides an extreme version of separation of policy from mechanism, sacrificing abstraction to expose (within protection constraints) all possible aspects of the physical environment to the next higher layer, giving that higher layer maximum flexibility in creating abstractions for its preferred programming environment, or tailored to its preferred application.
5.2.1Butler W. Lampson 和 Howard E. Sturgis。《操作系统设计思考》。《ACM 通讯》19,5(1976 年 5 月),第 251-265 页。
5.2.1 Butler W. Lampson and Howard E. Sturgis. Reflections on an operating system design. Communications of the ACM 19, 5 (May 1976), pages 251–265.
加州大学伯克利分校设计的 CAL 操作系统似乎是第一个在操作系统界面中明确使用类型的系统。除了介绍这一想法之外,Lampson 和 Sturgis 还深入分析了各种设计决策的利弊。该系统记录较晚,实际上于 1969 年实施。
An operating system named CAL, designed at the University of California at Berkeley, appears to be the first system to make explicit use of types in the interface to the operating system. In addition to introducing this idea, Lampson and Sturgis also give good insight into the pros and cons of various design decisions. Documented late, the system was actually implemented in 1969.
5.2.2Michael D. Schroeder、David D. Clark 和 Jerome H. Saltzer。Multics 内核设计项目。《第六届 ACM 操作系统原理研讨会论文集》,载于《操作系统评论》第 11 卷第 5 期(1977 年 11 月),第 43-56 页。
5.2.2 Michael D. Schroeder, David D. Clark, and Jerome H. Saltzer. The Multics kernel design project. Proceedings of the Sixth ACM Symposium on Operating Systems Principles, in Operating Systems Review 11, 5 (November 1977), pages 43–56.
本文讨论了在将类型扩展(以及微内核思维,尽管当时不这么称呼)应用于 Multics 时遇到的一系列问题,以简化其内部组织并减少其可信基础的大小。Philippe Janson 的博士论文《使用类型扩展来组织虚拟内存机制》(麻省理工学院电气工程和计算机科学系,1976 年 8 月)更深入地探讨了其中的许多想法。该论文也可作为麻省理工学院计算机科学实验室技术报告 TR-167(1976 年 9 月)使用。
This paper addresses a wide range of issues encountered in applying type extension (as well as microkernel thinking, though it wasn’t called that at the time) to Multics in order to simplify its internal organization and reduce the size of its trusted base. Many of these ideas were explored in even more depth in Philippe Janson’s Ph.D. Thesis, Using Type Extension to Organize Virtual Memory Mechanisms, M.I.T. Department of Electrical Engineering and Computer Science, August 1976. That thesis is also available as M.I.T. Laboratory for Computer Science Technical Report TR–167, September 1976.
5.2.3Galen C. Hunt 和 James R. Larus。《奇点:重新思考软件堆栈》。《操作系统评论》41,2(2007 年 4 月),第 37-49 页。
5.2.3 Galen C. Hunt and James R. Larus. Singularity: Rethinking the software stack. Operating Systems Review 41, 2 (April 2007), pages 37–49.
Singularity 是一种操作系统,它使用类型安全语言来强制不同软件模块之间的模块化,而不是依赖虚拟内存硬件。内核和所有应用程序均采用具有自动垃圾收集功能的强类型编程语言编写。它们在单个地址空间中运行,并由语言运行时相互隔离。它们只能通过携带类型检查消息的通信通道相互交互。
Singularity is an operating system that uses type-safe languages to enforce modularity between different software modules, instead of relying on virtual-memory hardware. The kernel and all applications are written in a strongly typed programming language with automatic garbage collection. They run in a single address space and are isolated from each other by the language runtime. They can interact with each other only through communication channels that carry type-checked messages.
5.3.1Andrew D. Birrell。线程编程简介。数字设备公司系统研究中心技术报告第 35 号,1989 年 1 月。33 页。(也出现在Greg Nelson 编辑的《使用 Modula-3 进行系统编程》第 4 章中,Prentice-Hall,1991 年,第 88-118 页。)C# 编程语言版本出现在 Microsoft Research 报告 MSR-TR-2005-68 中。
5.3.1 Andrew D. Birrell. An introduction to programming with threads. Digital Equipment Corporation Systems Research Center Technical Report #35, January 1989. 33 pages. (Also appears as Chapter 4 of Greg Nelson, editor, Systems Programming with Modula-3, Prentice-Hall, 1991, pages 88–118.) A version for the C# programming language appeared as Microsoft Research Report MSR-TR-2005–68.
这是一个非常棒的教程,它清楚地解释了基本问题,并展示了正确有效地利用线程所涉及的细微差别。
This is an excellent tutorial, explaining the fundamental issues clearly and going on to show the subtleties involved in exploiting threads correctly and effectively.
5.3.2Thomas E. Anderson 等人。调度程序激活:对用户级并行管理的有效内核支持。ACM计算机系统学报 10,1(1992 年 2 月),第 53-79 页。最初发表于第十三届 ACM 操作系统原理研讨会论文集,载于操作系统评论 25,5(1991 年 12 月),第 95-109 页。
5.3.2 Thomas E. Anderson et al. Scheduler activations: Effective kernel support for the user-level management of parallelism. ACM Transactions on Computer Systems 10, 1 (February 1992), pages 53–79. Originally published in Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, in Operating Systems Review 25, 5 (December 1991), pages 95–109.
本文突出了用户线程和内核线程之间的区别,并提供了一种通过拥有正确的用户/内核线程接口来获得两者优势的方法。本文还重新讨论了虚拟处理器的概念,但采用的是多处理器环境。
The distinction between user threads and kernel threads comes to the fore in this paper, which offers a way of getting the advantages of both by having the right kind of user/kernel thread interface. The paper also revisits the idea of a virtual processor, but in a multiprocessor context.
5.3.3David D. Clark。使用上行调用构建系统。《第十届 ACM 操作系统原理研讨会论文集》,载《操作系统评论》第 19 卷第 5 期(1985 年 12 月),第 171-180 页。
5.3.3 David D. Clark. The structuring of systems using upcalls. Proceedings of the Tenth ACM Symposium on Operating Systems Principles, in Operating Systems Review 19, 5 (December 1985), pages 171–180.
尝试通过严格的分层来强制实施模块化结构有时会忽略最合适的结构的本质。本文描述了一种截然不同的模块间组织,在处理网络实现时似乎特别有效。
Attempts to impose modular structure by strict layering sometimes manage to overlook the essence of what structure is most appropriate. This paper describes a rather different intermodule organization that seems to be especially effective when dealing with network implementations.
5.3.4Jerome H. Saltzer。多路复用计算机系统中的流量控制。博士论文,麻省理工学院,电气工程系,1966 年 6 月。也可作为 1966 年 MAC 项目技术报告 TR-30 获得。
5.3.4 Jerome H. Saltzer. Traffic Control in a Multiplexed Computer System. Ph.D. Thesis, Massachusetts Institute of Technology, Department of Electrical Engineering, June 1966. Also available as Project MAC Technical Report TR–30, 1966.
这项工作描述了可能是第一个系统化的虚拟处理器设计和线程包,即 Multics 系统中使用的多处理器复用方案。它定义了协调原语BLOCK和WAKEUP,它们是每个线程分配一个二进制信号量的示例。
This work describes what is probably the first systematic virtual processor design and thread package, the multiprocessor multiplexing scheme used in the Multics system. It defines the coordination primitives BLOCK and WAKEUP, which are examples of binary semaphores assigned one per thread.
5.3.5Rob Pike 等。共享内存多处理器上的处理器睡眠和唤醒。EurOpen会议论文集(1991),第 161-166 页。
5.3.5 Rob Pike et al. Processor sleep and wakeup on a shared-memory multiprocessor. Proceedings of the EurOpen Conference (1991), pages 161–166.
这篇写得很好的论文很好地解释了在共享内存多处理器上实现抢占式多路复用、处理中断和正确实现协调原语有多么困难。
This well-written paper does an excellent job of explaining how difficult it is to get preemptive multiplexing, handling interrupts, and implementing coordination primitives correct on shared-memory multiprocessor.
很少有论文描述了一种简单、干净的设计。较旧的论文(其中一些可以在阅读第 3.1 节中找到)陷入了技术限制;较新的论文(其中一些可以在阅读第 6.1 节关于多级内存管理中找到)经常陷入性能优化的困境。关于使用 Intel x86 强制模块化的演变的案例研究(参见第5 章第5.7 节)描述了最广泛使用的处理器中的虚拟内存支持,并展示了它是如何随着时间的推移而演变的。
There are few examples of papers that describe a simple, clean design. The older papers (some can be found in reading section 3.1) get bogged down in technology constraints; the more recent papers (some of the them can be found in reading section 6.1 on multilevel memory management) often get bogged down in performance optimizations. The case study on the evolution of enforcing modularity with the Intel x86 (see Section 5.7 of Chapter 5) describes virtual memory support in the most widely used processor and shows how it evolved over time.
5.4.1A[ndre] Bensoussan、C[harles] T. Clingen 和 R[obert] C. Daley。Multics 虚拟内存:概念和设计。Communications of the ACM 15 , 5(1972 年 5 月),第 308–318 页。
5.4.1 A[ndre] Bensoussan, C[harles] T. Clingen, and R[obert] C. Daley. The Multics virtual memory: Concepts and design. Communications of the ACM 15, 5 (May 1972), pages 308–318.
这是对一个系统的一个很好的描述,该系统率先使用高性能寻址架构来支持复杂的虚拟内存系统,包括内存映射文件。该设计受到现有硬件技术(具有 18 位地址空间的 0.3 MIPS 处理器)的限制和影响,但这篇论文是一篇经典且易于阅读的文章。
This is a good description of a system that pioneered the use of high-powered addressing architectures to support a sophisticated virtual memory system, including memory-mapped files. The design was constrained and shaped by the available hardware technology (0.3 MIPS processor with an 18-bit address space), but the paper is a classic and easy to read.
每本现代教科书都涵盖了协调这一主题,但通常都忽略了细节,而且通常过分强调各种机制。这些读物要么更仔细地解释问题,要么从各个方向扩展基本概念。
Every modern textbook covers the topic of coordination but typically brushes past the subtleties and also typically gives the various mechanisms more emphasis than they deserve. These readings either explain the issues much more carefully or extend the basic concepts in various directions.
5.5.1E[dsger] W. Dijkstra。协同顺序进程。收录于 F. Genuys 主编的《编程语言》,北约高级研究所,维拉尔德朗,1966 年。Academic Press,1968 年,第 43-112 页。本文介绍了信号量,这是学术练习中最常用的同步原语,它以非常谨慎、循序渐进的方式开发了互斥的要求及其实现。许多现代处理方法都忽略了这里讨论的微妙之处,好像它们是显而易见的。其实不然,如果你想了解同步,你应该阅读本文。
5.5.1 E[dsger] W. Dijkstra. Co-operating sequential processes. In F. Genuys, editor, Programming Languages, NATO Advanced Study Institute, Villard-de-Lans, 1966. Academic Press, 1968, pages 43–112.This paper introduces semaphores, the synchronizing primitive most often used in academic exercises, and is notable for its very careful, step-by-step development of the requirements for mutual exclusion and its implementation. Many modern treatments ignore the subtleties discussed here as if they were obvious. They aren’t, and if you want to understand synchronization you should read this paper.
5.5.2E[dsger] W. Dijkstra。并发编程控制中的问题解决方案。ACM通讯 8、9(1965 年 9 月),第 569 页。
5.5.2 E[dsger] W. Dijkstra. Solution of a problem in concurrent programming control. Communications of the ACM 8, 9 (September 1965), page 569.
在这篇非常简短的论文中,Dijkstra 首先报告了 Dekker 的观察结果:多处理器锁可以完全用软件实现,依靠硬件仅保证读写操作具有之前或之后的原子性。
In this very brief paper, Dijkstra first reports Dekker’s observation that multiprocessor locks can be implemented entirely in software, relying on the hardware to guarantee only that read and write operations have before-or-after atomicity.
5.5.3Leslie Lamport。一种快速互斥算法。ACM Transactions on Computer Systems 5,1(1987 年 2 月),第 1-11 页。
5.5.3 Leslie Lamport. A fast mutual exclusion algorithm. ACM Transactions on Computer Systems 5, 1 (February 1987), pages 1–11.
本文介绍了一种纯软件实现的锁的快速版本,并论证了为什么该版本是最佳的。
This paper presents a fast version of a software-only implementation of locks and gives an argument as to why this version is optimal.
5.5.4David P. Reed 和 Rajendra K. Kanodia。通过事件计数和序列器进行同步。《ACM 通讯》22,2(1979 年 2 月),第 115-123 页。
5.5.4 David P. Reed and Rajendra K. Kanodia. Synchronization with eventcounts and sequencers. Communications of the ACM 22, 2 (February 1979), pages 115–123.
本文介绍了一种极其简单的协调系统,它使用的功能不如互斥原语强大的排序原语,因此结果是简单的正确性论证。
This paper introduces an extremely simple coordination system that uses less powerful primitives for sequencing than for mutual exclusion; a consequence is simple correctness arguments.
5.5.5Butler W. Lampson 和 David D. Redell。《Mesa 中的进程和监视器经验》。《ACM 通讯》23,2(1980 年 2 月),第 105-117 页。
5.5.5 Butler W. Lampson and David D. Redell. Experience with processes and monitors in Mesa. Communications of the ACM 23, 2 (February 1980), pages 105–117.
这是关于将并发活动协调集成到编程语言中所涉及的缺陷的一次很好的讨论。
This is a nice discussion of the pitfalls involved in integrating concurrent activity coordination into a programming language.
5.5.6Stefan Savage 等人。Eraser:用于多线程程序的动态数据争用检测器。ACM Transactions on Computer Systems 15,4(1997 年 11 月),第 391-411 页。另见第十六届 ACM 操作系统原理研讨会论文集(1997 年 10 月)。
5.5.6 Stefan Savage et al. Eraser: A dynamic data race detector for multi-threaded programs. ACM Transactions on Computer Systems 15, 4 (November 1997), pages 391–411. Also in the Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles (October 1997).
本文介绍了一种用于定位某些类别的锁定错误的有趣策略:通过修补程序的二进制数据引用来检测程序;然后观察这些数据引用以查看程序是否违反了锁定协议。
This paper describes an interesting strategy for locating certain classes of locking mistakes: instrument the program by patching its binary data references; then watch those data references to see if the program violates the locking protocol.
5.5.7Paul E. McKenney 等人。阅读副本更新。渥太华 Linux 研讨会论文集,2002 年,第 338-367 页。
5.5.7 Paul E. McKenney et al. Read-copy update. Proceedings of the Ottawa Linux Symposium, 2002, pages 338–367.
本文观察到,对于读取次数最多且修改次数不多的数据结构,锁定可能是一种昂贵的前后原子性机制。作者提出了一种新技术,即读取-复制更新 (RCU),它提高了性能和可扩展性。Linux 内核对处理器最常读取的许多数据结构都使用了这种机制。
This paper observes that locks can be an expensive mechanism for before-or-after atomicity for data structures that are mostly read and infrequently modified. The authors propose a new technique, read-copy update (RCU), which improves performance and scalability. The Linux kernel uses this mechanism for many of its data structures that processors mostly read.
5.5.8Maurice Herlihy。无等待同步。ACM编程语言和系统事务11,1(1991 年 1 月),第 124-149 页。
5.5.8 Maurice Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems 11, 1 (January 1991), pages 124–149.
本文介绍了无等待同步(现在通常称为非阻塞协调)的目标,并给出了集合、列表和队列等常见数据结构的非阻塞、并发实现。
This paper introduces the goal of wait-free synchronization, now often called non-blocking coordination, and gives non-blocking, concurrent implementations of common data structures such as sets, lists, and queues.
5.5.9Timothy L. Harris。非阻塞链接列表的实用实现。第十五届国际分布式计算研讨会论文集(2001 年 10 月),第 300-314 页。
5.5.9 Timothy L. Harris. A pragmatic implementation of non-blocking linked lists. Proceedings of the fifteenth International Symposium on Distributed Computing, (October 2001), pages 300–314.
本文介绍了一种链表的实际实现,其中线程可以并发插入而不会阻塞。
This paper describes a practical implementation of a linked list in which threads can insert concurrently without blocking.
另请参阅Brinch Hansen 撰写的阅读材料5.1.1,其中使用消息作为协调技术,以及Birrell 撰写的阅读材料5.3.1,其中描述了使用线程编程的一套完整的协调原语。
See also reading 5.1.1, by Brinch Hansen, which uses messages as a coordination technique, and reading 5.3.1 by Birrell, which describes a complete set of coordination primitives for programming with threads.
5.6.1Robert J. Creasy。VM/370 分时系统的起源。IBM研究与开发杂志 25,5(1981 年),第 483-490 页。
5.6.1 Robert J. Creasy. The origin of the VM/370 time-sharing system. IBM Journal of Research and Development 25, 5 (1981), pages 483–490.
本文对 20 世纪 60 年代中期 IBM 360 计算机架构虚拟化项目以及 VM/370 的开发进行了深刻回顾,VM/370 在 20 世纪 70 年代成为流行的虚拟机系统。当时,VM/370 的独特之处在于它创建了一个严格、按部就班的硬件虚拟机,从而能够在受控环境中运行任何 system/370 程序。由于这是一个先驱项目,作者对此进行了特别深入的解释,从而很好地介绍了虚拟机实现中的概念和问题。
This paper is an insightful retrospective about a mid-1960s project to virtualize the IBM 360 computer architecture and the development that led to VM/370, which in the 1970s became a popular virtual machine system. At the time, the unusual feature of VM/370 was its creation of a strict, by-the-book, hardware virtual machine, thus providing the ability to run any system/370 program in a controlled environment. Because it was a pioneer project, the author explained things particularly well, thus providing a good introduction to the concepts and problems in implementing virtual machines.
5.6.2Edouard Bugnion 等人。Disco:在可扩展多处理器上运行商用操作系统。ACM Transactions on Computer Systems 15 , 14(1997 年 11 月),第 412-447 页。
5.6.2 Edouard Bugnion et al. Disco: running commodity operating systems on scalable multiprocessors. ACM Transactions on Computer Systems 15, 14 (November 1997), pages 412–447.
本文使虚拟机重新成为构建系统的主流方式。
This paper brought virtual machines back as a mainstream way of building systems.
5.6.3Carl Waldspurger。VMware ESX 服务器中的内存资源管理。第五届 USENIX 操作系统设计和实现研讨会论文集(2002 年 12 月),第 181-194 页。
5.6.3 Carl Waldspurger. Memory resource management in VMware ESX server. Proceedings of the Fifth USENIX Symposium on Operating Systems Design and Implementation (December 2002), pages 181–194.
这篇精心撰写的论文介绍了一个很好的技巧(气球驱动程序)来决定为客户操作系统提供多少物理内存。
This well-written paper introduces a nice trick (a balloon driver) to decide how much physical memory to give to guest operating systems.
5.6.4Keith Adams 和 Ole Agesen。x86 虚拟化的软件和硬件技术比较。第十二届编程语言和操作系统架构支持研讨会论文集(2006 年 10 月)。ISBN:1–59593–451–0。另见《操作系统评论》40,5(2006 年 12 月),第 2–13 页。
5.6.4 Keith Adams and Ole Agesen. A comparison of software and hardware techniques for x86 virtualization. Proceedings of the Twelfth Symposium on Architectural Support for Programming Languages and Operating Systems (October 2006). ISBN: 1–59593–451–0. Also in Operating Systems Review 40, 5 (December 2006), pages 2–13.
本文介绍了如何虚拟化 Intel x86 指令集以构建高性能虚拟机。它比较了两种实施策略:一种使用软件技术(例如二进制重写)来虚拟化指令集,另一种使用 x86 处理器的最新硬件添加来简化虚拟化。通过比较,您可以了解在现代 x86 处理器中实施现代虚拟机和操作系统支持的见解。
This paper describes how one can virtualize the Intel x86 instruction set to build a high-performance virtual machine. It compares two implementation strategies: one that uses software techniques, such as binary rewriting, to virtualize the instruction set, and one that uses recent hardware additions to the x86 processor to make virtualizing easier. The comparison provides insights about implementing modern virtual machines and operating system support in modern x86 processors.
另请参阅有关 VAX 机器的安全虚拟机监视器的论文,阅读内容为11.3.5。
Also see the paper on the secure virtual machine monitor for the VAX machine, reading 11.3.5.
在Patterson 和 Hennessy 所著的书的第 5 章(阅读1.1.1)中可以找到对内存层次结构的出色讨论,其中特别关注了缓存的设计空间。在Tanenbaum 的计算机系统书的第 3 章(阅读1.2.1)中可以找到更轻松的处理,更侧重于虚拟内存,并包括对堆栈算法的讨论。
An excellent discussion of memory hierarchies, with special attention paid to the design space for caches, can be found in Chapter 5 of the book by Patterson and Hennessy, reading 1.1.1. A lighter-weight treatment focused more on virtual memory, and including a discussion of stack algorithms, can be found in Chapter 3 of Tanenbaum’s computer systems book, reading 1.2.1.
6.1.1R[obert] A. Frieburghouse。通过计数实现寄存器分配。《ACM 通讯》17,11(1974 年 11 月),第 638-642 页。
6.1.1 R[obert] A. Frieburghouse. Register allocation via usage counts. Communications of the ACM 17, 11 (November 1974), pages 638–642.
本文表明编译器代码生成器必须进行多级内存管理,并且它们存在与缓存和分页系统相同的问题。
This paper shows that compiler code generators must do multilevel memory management and that they have the same problems as do caches and paging systems.
6.1.2R[ichard] L. Mattson、J. Gecsei、D[onald] R. Slutz 和 I[rving] L. Traiger。存储层次结构的评估技术。IBM Systems Journal 9,2(1970 年),第 78-117 页。
6.1.2 R[ichard] L. Mattson, J. Gecsei, D[onald] R. Slutz, and I[rving] L. Traiger. Evaluation techniques for storage hierarchies. IBM Systems Journal 9, 2 (1970), pages 78–117.
作为关于堆栈算法及其分析的原始参考,该论文写得很好,并且比现代教科书中的简短摘要提出了更为深入的观察。
The original reference on stack algorithms and their analysis, this paper is well written and presents considerably more in-depth observations than the brief summaries that appear in modern textbooks.
6.1.3Richard Rashid 等人。分页式单处理器和多处理器架构的独立于机器的虚拟内存管理。IEEE计算机学报 37,8(1988 年 8 月),第 896-908 页。最初发表于第二届编程语言和操作系统架构支持国际会议论文集(1987 年 11 月),第 31-39 页。
6.1.3 Richard Rashid et al. Machine-independent virtual memory management for paged uniprocessor and multiprocessor architectures. IEEE Transactions on Computers 37, 8 (August 1988), pages 896–908. Originally published in Proceedings of the Second International Conference on Architectural Support for Programming Languages and Operating Systems (November 1987), pages 31–39.
本文介绍了一种复杂的虚拟内存系统的设计,该系统已被多种操作系统采用,包括几种 BSD 操作系统和 Apple 的 OS X。该系统支持大型、稀疏的虚拟地址空间、页面的写时复制和内存映射文件。
This paper describes a design for a sophisticated virtual memory system that has been adopted by several operating systems, including several BSD operating systems and Apple’s OS X. The system supports large, sparse virtual address spaces, copy-on-write copying of pages, and memory-mapped files.
6.1.4Ted Kaehler 和 Glenn Krasner。LOOM:面向 Smalltalk-80 系统的大型面向对象内存。收录于 Glenn Krasner 编辑的《Smalltalk-80:历史片段,忠告》。Addison-Wesley,1983 年,第 251-271 页。ISBN:0-201-11669-3。
6.1.4 Ted Kaehler and Glenn Krasner. LOOM: Large object-oriented memory for Smalltalk-80 systems. In Glenn Krasner, editor, Smalltalk-80: Bits of History, Words of Advice. Addison-Wesley, 1983, pages 251–271. ISBN: 0–201–11669–3.
本文介绍了 Smalltalk(一种用于台式计算机的交互式编程系统)中使用的内存管理系统。连贯的虚拟内存语言支持系统提供了大量小对象,同时以集成方式考虑了地址空间分配、多级内存管理和命名。
This paper describes the memory-management system used in Smalltalk, an interactive programming system for desktop computers. A coherent virtual memory language support system provides for lots of small objects while taking into account address space allocation, multilevel memory management, and naming in an integrated way.
Swinehart 等人撰写的有关 Woodstock 文件系统的论文(阅读4.2.1)描述了一种采用多级内存管理系统组织的文件系统。另请参阅阅读 10.1.8,了解使用多级内存管理的一个有趣应用程序(共享虚拟内存)。
The paper on the Woodstock File System, by Swinehart et al., reading 4.2.1, describes a file system that is organized as a multilevel memory management system. Also see reading 10.1.8 for an interesting application (shared virtual memory) using multilevel memory management.
6.2.1Michael D. Schroeder 和 Michael Burrows。《Firefly RPC 的性能》。《ACM 计算机系统学报》第 8 卷,第 1 期(1990 年 2 月),第 1-17 页。最初发表于《第十二届 ACM 操作系统原理研讨会论文集》,载于《操作系统评论》第 23 卷,第 5 期(1989 年 12 月),第 102-113 页。
6.2.1 Michael D. Schroeder and Michael Burrows. Performance of Firefly RPC. ACM Transactions on Computer Systems 8, 1 (February 1990), pages 1–17. Originally published in Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, in Operating Systems Review 23, 5 (December 1989), pages 102–113.
作为对阅读材料4.1.1中远程过程调用抽象讨论的补充,本文对特定实现中所需的步骤进行了具体的、详细的说明,然后将此说明与总体时间测量值进行比较。除了深入了解远程过程的固有成本外,这项工作还表明,可以进行自下而上的性能分析,这与自上而下的测量值有很好的相关性。
As a complement to the abstract discussion of remote procedure call in reading 4.1.1, this paper gives a concrete, blow-by-blow accounting of the steps required in a particular implementation and then compares this accounting with overall time measurements. In addition to providing insight into the intrinsic costs of remote procedures, this work demonstrates that it is possible to do bottom-up performance analysis that correlates well with top-down measurements.
6.2.2Brian N. Bershad、Thomas E. Anderson、Edward D. Lazowska 和 Henry M. Levy。轻量级远程过程调用。ACM Transactions on Computer Systems 8,1(1990 年 2 月),第 37-55 页。最初发表于第十二届 ACM 操作系统原理研讨会论文集,载于Operating Systems Review 23,5(1989 年 12 月),第 102-113 页。
6.2.2 Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. Lightweight remote procedure call. ACM Transactions on Computer Systems 8, 1 (February 1990), pages 37–55. Originally published in Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, in Operating Systems Review 23, 5 (December 1989), pages 102–113.
6.2.3Jochen Liedtke。通过内核设计改进 IPC。第十四届 ACM 操作系统原理研讨会论文集,载于《操作系统评论》第 27 卷,第 5 期(1993 年 12 月),第 175-187 页。
6.2.3 Jochen Liedtke. Improving IPC by kernel design. Proceedings of the Fourteenth ACM Symposium on Operating Systems Principles, in Operating Systems Review 27, 5 (December 1993), pages 175–187.
这两篇论文开发的技术使得本地基于内核的客户端/服务模块化在应用程序设计人员看来就像远程客户端/服务模块化一样,同时还能捕捉到本地带来的性能优势。
These two papers develop techniques to allow local kernel-based client/service modularity to look just like remote client/service modularity to the application designer, while at the same time capturing the performance advantage that can come from being local.
6.3.1Chris Ruemmler 和 John Wilkes。磁盘驱动器建模简介。Computer 27,3(1994 年 3 月),第 17-28 页。
6.3.1 Chris Ruemmler and John Wilkes. An introduction to disk drive modeling. Computer 27, 3 (March 1994), pages 17–28.
本文实际上是两篇论文合二为一。前五页非常通俗易懂地解释了磁盘驱动器和控制器的实际工作原理。本文的其余部分主要针对性能建模专家,探讨了准确模拟复杂磁盘驱动器的问题,并使用测量数据显示由各种建模简化(或过度简化)引起的误差大小。
This paper is really two papers in one. The first five pages provide a wonderfully accessible explanation of how disk drives and controllers actually work. The rest of the paper, of interest primarily to performance modeling specialists, explores the problem of accurately simulating a complex disk drive, with measurement data to show the size of errors that arise from various modeling simplifications (or oversimplifications).
6.3.2Marshall K. McKusick、William N. Joy、Samuel J. Leffler 和 Robert S. Fabry。UNIX 的快速文件系统。ACM Transactions on Computer Systems 2、3(1984 年 8 月),第 181-197 页。
6.3.2 Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry. A fast file system for UNIX. ACM Transactions on Computer Systems 2, 3 (August 1984), pages 181–197.
“快速文件系统”很好地展示了在最初设计为简单性的典范的文件系统中添加几种众所周知的性能增强技术(如多种块大小和基于邻接性的扇区分配)时性能和复杂性之间的权衡。
The “fast file system” nicely demonstrates the trade-offs between performance and complexity in adding several well-known performance enhancement techniques, such as multiple block sizes and sector allocation based on adjacency, to a file system that was originally designed as the epitome of simplicity.
6.3.3Gregory R. Ganger 和 Yale N. Patt。文件系统中的元数据更新性能。第一届 USENIX 操作系统设计和实现研讨会论文集(1994 年 11 月),第 49-60 页。
6.3.3 Gregory R. Ganger and Yale N. Patt. Metadata update performance in file systems. Proceedings of the First USENIX Symposium on Operating Systems Design and Implementation (November 1994), pages 49–60.
本文将最初为数据库系统开发的一些恢复和一致性概念应用于文件系统。它描述了一些简单的规则(例如,在写入其指向的磁盘块之后,应将 inode 写入磁盘),这些规则允许系统设计人员实现高性能的文件系统,并且在出现故障时始终保持其磁盘上数据结构的一致性。当应用程序执行文件操作时,这些规则会在写后缓存中的数据块之间创建依赖关系。了解这些依赖关系的磁盘驱动程序可以按顺序将缓存的块写入磁盘,从而即使系统崩溃也能保持磁盘上数据结构的一致性。
This paper is an application to file systems of some recovery and consistency concepts originally developed for database systems. It describes a few simple rules (e.g., an inode should be written to the disk after writing the disk blocks to which it points) that allow a system designer to implement a file system that is high performance and always keeps its on-disk data structures consistent in the presence of failures. As applications perform file operations, the rules create dependencies between data blocks in the write-behind cache. A disk driver that knows about these dependencies can write the cached blocks to disk in an order that maintains consistency of on-disk data structures despite system crashes.
6.3.4Andrew Birrell 等人。高性能闪存盘的设计。ACM操作系统评论 41,2(2007 年 4 月),第 88-93 页。(也出现在 Microsoft Corporation 技术报告 TR-2005-176 中。)
6.3.4 Andrew Birrell et al. A design for high-performance flash disks. ACM Operating Systems Review 41, 2 (April 2007), pages 88–93. (Also appeared as Microsoft Corporation technical report TR-2005-176.)
闪存(非易失性)电子存储器以磁盘的形式组织起来,已成为一种更昂贵但延迟极低的磁盘替代品,可用于持久存储。这篇短文以通俗易懂的方式描述了使用闪存盘构建高性能文件系统所面临的挑战,并提出了一种应对这些挑战的设计。对于想要探索基于闪存的存储系统的读者来说,这篇论文是一个很好的开始。
Flash (non-volatile) electronic memory organized to appear as a disk has emerged as a more expensive but very low-latency alternative to magnetic disks for durable storage. This short paper describes, in an easy-to-understand way, the challenges associated with building a high-performance file system using flash disks and proposes a design to address the challenges. This paper is a good start for readers who want to explore flash-based storage systems.
6.4.1Sharon E. Perl 和 Richard L. Sites。使用动态执行跟踪研究 Windows NT 性能。第二届 USENIX 操作系统设计和实现研讨会论文集(1996 年 10 月)。另见《操作系统评论》第 30 卷,SI(1996 年 10 月),第 169–184 页。
6.4.1 Sharon E. Perl and Richard L. Sites. Studies of Windows NT performance using dynamic execution traces. Proceedings of the Second USENIX Symposium on Operating Systems Design and Implementation (October 1996). Also in Operating System Review 30, SI (October 1996), pages 169–184.
本文通过实例说明,计算机系统中的任何性能问题都可以得到解释。作者创建了一个工具来收集 Windows NT 操作系统和应用程序执行的指令的完整踪迹。作者得出结论,引脚带宽限制了应用程序可实现的执行速度,而操作系统内部的锁可能会限制应用程序扩展到超过中等数量的处理器。本文还讨论了缓存一致性硬件(参见第 10 章 [在线])对应用程序性能的影响。所有这些问题对于单芯片上的多处理器来说都越来越重要。
This paper shows by example that any performance issue in computer systems can be explained. The authors created a tool to collect complete traces of instructions executed by the Windows NT operating system and applications. The authors conclude that pin bandwidth limits the achievable execution speed of applications and that locks inside the operating system can limit applications to scale to more than a moderate number of processors. The paper also discusses the impact of cache-coherence hardware (see Chapter 10 [on-line]) on application performance. All of these issues are increasingly important for multiprocessors on a single chip.
6.4.2Jeffrey C. Mogul 和 KK Ramakrishnan。《消除中断驱动内核中的接收活锁》。《计算机系统学报》15,3(1997 年 8 月),第 217-252 页。
6.4.2 Jeffrey C. Mogul and K. K. Ramakrishnan. Eliminating receive livelock in an interrupt-driven kernel. Transactions on Computer Systems 15, 3 (August 1997), pages 217–252.
本文介绍了接收活锁问题(见边栏 6.7)并提出了解决方案。接收活锁是系统暂时过载时可能出现的一种不良情况。如果服务器花费太多时间说“我太忙”,结果没有时间处理任何请求,就会出现这种情况。
This paper introduces the problem of receive livelock (described in Sidebar 6.7) and presents a solution. Receive livelock is a possible undesirable situation when a system is temporarily overloaded. It can arise if the server spends too much of its time saying “I’m too busy” and as a result has no time left to serve any of the requests.
6.4.3Jeffrey Dean 和 Sanjay Ghemawat。MapReduce:简化大型集群上的数据处理。《第六届 USENIX 操作系统设计和实现研讨会论文集》(2004 年 12 月),第 137-150 页。另见《ACM 通讯》51,1(2008 年 1 月),第 1-10 页。
6.4.3 Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. Proceedings of the Sixth USENIX Symposium on Operating Systems Design and Implementation (December 2004), pages 137–150. Also in Communications of the ACM 51, 1 (January 2008), pages 1–10.
本文是一个案例研究,研究如何聚合数千台计算机的数组,对大型数据集(例如,Web 的所有页面)执行并行计算。它使用一种模型,该模型适用于两个串行函数(Map 和 Reduce)的组合对数据集没有副作用的情况。MapReduce 的魅力在于,对于符合模型的计算,运行时使用并发性,但对程序员完全隐藏了它。运行时对输入数据集进行分区,在数据集的不同部分并行执行函数,并处理各个计算机的故障。
This paper is a case study of aggregating arrays (reaching into the thousands) of computers to perform parallel computations on large data sets (e.g., all the pages of the Web). It uses a model that applies when a composition of two serial functions (Map and Reduce) has no side-effects on the data sets. The charm of MapReduce is that for computations that fit the model, the runtime uses concurrency but hides it completely from the programmer. The runtime partitions the input data set, executes the functions in parallel on different parts of the data set, and handles the failures of individual computers.
《IEEE 论文集》第 66 卷,第 11 期(1978 年 11 月)是该期刊专门讨论分组交换的特刊,其中包含此处提到的各种主题下的几篇论文。它们共同构成了计算机通信领域早期的广泛参考书目。
Proceedings of the IEEE 66, 11 (November 1978), a special issue of that journal devoted to packet switching, contains several papers mentioned under various topics here. Collectively, they provide an extensive early bibliography on computer communications.
Perlman 所著的有关网桥和路由器的书(第 1.2.5节)解释了网络层的实际工作原理。
The book by Perlman on bridges and routers, reading 1.2.5, explains how the network layer really works.
7.1.1David D. Clark、Kenneth T. Pogran 和 David P. Reed。局域网简介。IEEE论文集 66,11(1978 年 11 月),第 1497-1517 页。
7.1.1 David D. Clark, Kenneth T. Pogran, and David P. Reed. An introduction to local area networks. Proceedings of the IEEE 66, 11 (November 1978), pages 1497–1517.
本局域网通信基础教程介绍了局域网的各种模块化组件(包括接口和协议),给出了具体示例,并解释了局域网与大型互连网络的关系。具体示例现在已经过时,但其余内容永不过时。
This basic tutorial on local area network communications characterizes the various modular components of a local area network, both interface and protocols, gives specific examples, and explains how local area networks relate to larger, interconnected networks. The specific examples are now out of date, but the rest of the material is timeless.
7.1.2Robert M. Metcalfe 和 David R. Boggs。以太网:本地计算机网络的分布式分组交换。ACM通讯 19,7(1976 年 7 月),第 395-404 页。
7.1.2 Robert M. Metcalfe and David R. Boggs. Ethernet: Distributed packet switching for local computer networks. Communications of the ACM 19, 7 (July 1976), pages 395–404.
本文提供了已被证明是最流行的局域网技术的设计。
This paper provides the design of what has proven to be the most popular local area network technology.
7.2.1Louis Pouzin 和 Hubert Zimmerman。协议教程。IEEE论文集 66,11(1978 年 11 月),第 1346-1370 页。这篇论文写得很好,除了细节外,还提供了观点。这篇论文写于很久以前,这也是它最大的吸引力。由于当时人们对网络的理解还不够深入,因此有必要充分解释所有假设并提供广泛的类比。这篇论文在这两方面都做得非常出色,因此它为现代文本提供了有用的补充。在阅读这篇论文时,任何熟悉当前网络技术的人都会经常惊呼:“原来互联网是这样运作的。”
7.2.1 Louis Pouzin and Hubert Zimmerman. A tutorial on protocols. Proceedings of the IEEE 66, 11 (November 1978), pages 1346–1370.This paper is well written and provides perspective along with the details. The fact that it was written a long time ago turns out to be its major appeal. Because networks were not widely understood at the time, it was necessary to fully explain all of the assumptions and offer extensive analogies. This paper does an excellent job of both, and as a consequence it provides a useful complement to modern texts. While reading this paper, anyone familiar with current network technology will frequently exclaim, “So that’s why the Internet works that way.”
7.2.2Vinton G. Cerf 和 Peter T. Kirstein。分组网络互连问题。IEEE论文集 66,11(1978 年 11 月),第 1386-1408 页。
7.2.2 Vinton G. Cerf and Peter T. Kirstein. Issues in packet-network interconnection. Proceedings of the IEEE 66, 11 (November 1978), pages 1386–1408.
在撰写本文时,一个新出现的问题是独立管理的数据通信网络的互连。本文从广度和深度两方面探讨了这个问题,这是最近的论文所没有提供的。
At the time this paper was written, an emerging problem was the interconnection of independently administered data communication networks. This paper explores the issues in both breadth and depth, a combination that more recent papers do not provide.
7.2.3David D. Clark 和 David L. Tennenhouse。新一代协议的架构考量。ACM SIGCOMM '91 会议:通信架构和协议,载于《计算机通信评论》20,4(1990 年 9 月),第 200-208 页。
7.2.3 David D. Clark and David L. Tennenhouse. Architectural considerations for a new generation of protocols. ACM SIGCOMM ’91 Conference: Communications Architectures and Protocols, in Computer Communication Review 20, 4 (September 1990), pages 200–208.
本文总结了 20 年的协议设计和实现经验,并列出了接下来几轮协议设计的要求。基本观察是,未来高速网络和应用程序的性能要求将要求用于协议描述的层不限制实现的分层。本文是开发新协议或协议套件的任何人必读的。
This paper captures 20 years of experience in protocol design and implementation and lays out the requirements for the next few rounds of protocol design. The basic observation is that the performance requirements of future high-speed networks and applications will require that the layers used for protocol description not constrain implementations to be similarly layered. This paper is required reading for anyone who is developing a new protocol or protocol suite.
7.2.4丹尼·科恩。《论圣战与和平恳求》。IEEE计算机 14,10(1981 年 10 月),第 48-54 页。
7.2.4 Danny Cohen. On holy wars and a plea for peace. IEEE Computer 14, 10 (October 1981), pages 48–54.
这是关于协议设计中大端和小端论点的有趣讨论。
This is an entertaining discussion of big-endian and little-endian arguments in protocol design.
7.2.5Danny Cohen。实时通信的流量控制。《计算机通信评论》10,1-2(1980 年 1 月/4 月),第 41-47 页。
7.2.5 Danny Cohen. Flow control for real-time communication. Computer Communication Review 10, 1–2 (January/April 1980), pages 41–47.
这个简短的项目是“仆人的困境”的根源,这个寓言提供了有用的见解,解释了为什么流量控制决策必须涉及应用程序。
This brief item is the source of the “servant’s dilemma”, a parable that provides helpful insight into why flow control decisions must involve the application.
7.2.6Geoff Huston。《剖析:网络地址转换器内部》。《互联网协议期刊》7,3(2004 年 9 月),第 2-32 页。
7.2.6 Geoff Huston. Anatomy: A look inside network address translators. The Internet Protocol Journal 7, 3 (September 2004), pages 2–32.
网络地址转换器 (NAT) 破坏了互联网的通用连接属性:使用 NAT 后,人们不再能假设互联网上的每台计算机都能与互联网上的每台其他计算机进行通信。本文讨论了 NAT 的动机、其工作原理以及它们对某些互联网应用程序造成何种破坏。
Network address translators (NATs) break down the universal connectivity property of the Internet: when NATs are in use, one can no longer assume that every computer in the Internet can communicate with every other computer in the Internet. This paper discusses the motivation for NATs, how they work, and in what ways they create havoc for some Internet applications.
7.2.7Van Jacobson。拥塞避免和控制。《通信架构和协议研讨会论文集》(SIGCOMM '88),第 314-329 页。另见《计算机通信评论》18,4(1988 年 8 月)。
7.2.7 Van Jacobson. Congestion avoidance and control. Proceedings of the Symposium on Communications Architectures and Protocols (SIGCOMM ‘88), pages 314–329. Also in Computer Communication Review 18, 4 (August 1988).
边栏 7.9 简要描述了互联网中最常用的传输协议 TCP 的拥塞避免和控制机制。本文详细解释了这些机制。它们非常简单,但已被证明是有效的。
Sidebar 7.9 gives a simplified description of the congestion avoidance and control mechanism of TCP, the most commonly used transport protocol in the Internet. This paper explains those mechanisms in full detail. They are surprisingly simple but have proven to be effective.
7.2.8Jordan Ritter。为什么 Gnutella 无法扩展。真的。未发表的灰色文献。< http://www.darkridge.com/~jpr5/doc/gnutella.html >本文提供了一个简单的性能模型来解释为什么 Gnutella 协议(参见问题集20)无法支持大型 Gnutella 对等网络。问题在于其带宽需求的扩展不相称。
7.2.8 Jordan Ritter. Why Gnutella can’t scale. No, really. Unpublished grey literature. <http://www.darkridge.com/~jpr5/doc/gnutella.html>This paper offers a simple performance model to explain why the Gnutella protocol (see problem set 20) cannot support large networks of Gnutella peers. The problem is incommensurate scaling of its bandwidth requirements.
7.2.9David B. Johnson。透明移动主机网络互联的可扩展支持。无线网络 1、3(1995 年),第 311-321 页。
7.2.9 David B. Johnson. Scalable support for transparent mobile host internetworking. Wireless Networks 1, 3 (1995), pages 311–321.
寻址通过无线电链路连接到网络并且可以在不中断网络连接的情况下从一个地方移动到另一个地方的笔记本电脑可能是一个挑战。本文提出了一种系统方法,该方法基于在笔记本电脑当前位置和位于其通常所在位置的代理之间维护一条隧道。本文的变体(基于作者 1993 年在卡内基梅隆大学的博士论文,可作为 CMU 计算机科学技术报告 CS-93-128 获得)出现在 1993 年和 1994 年的几次研讨会和会议上,以及《移动计算》一书中,Tomasz Imielinski 和 Henry F. Korth 编辑,Kluwer Academic Publishers,1996 年左右。ISBN:079239697-9。
Addressing a laptop computer that is connected to a network by a radio link and that can move from place to place without disrupting network connections can be a challenge. This paper proposes a systematic approach based on maintaining a tunnel between the laptop computer’s current location and an agent located at its usual home location. Variations of this paper (based on the author’s 1993 Ph.D. Thesis at Carnegie-Mellon University and available as CMU Computer Science Technical Report CS–93–128) have appeared in several 1993 and 1994 workshops and conferences, as well as in the book Mobile Computing, Tomasz Imielinski and Henry F. Korth, editors, Kluwer Academic Publishers, c. 1996. ISBN: 079239697-9.
Birrell 和 Nelson 在阅读4.1.1中以及 Tanenbaum 的《现代操作系统》第10.3 节阅读1.2.1中深入介绍了一种流行的协议,即远程过程调用。
One popular protocol, remote procedure call, is covered in depth in reading 4.1.1 by Birrell and Nelson, as well as Section 10.3 of Tanenbaum’s Modern Operating Systems, reading 1.2.1.
7.3.1Leonard Kleinrock。分组通信的原理和经验教训。IEEE 论文集 66,11(1978 年 11 月),第 1320-1329 页。
7.3.1 Leonard Kleinrock. Principles and lessons in packet communications. Proceedings of the IEEE 66, 11 (November 1978), pages 1320–1329.
7.3.2Lawrence G. Roberts。分组交换的演进。IEEE论文集 66,11(1978 年 11 月),第 1307-1813 页。
7.3.2 Lawrence G. Roberts. The evolution of packet switching. Proceedings of the IEEE 66, 11 (November 1978), pages 1307–1813.
这两篇论文讨论了 ARPANET 的经验。任何需要设计网络的人都应该看看这两篇论文,它们侧重于经验教训和意外来源。
These two papers discuss experience with the ARPANET. Anyone faced with the need to design a network should look over these two papers, which focus on lessons learned and the sources of surprise.
7.3.3J[erome] H. Saltzer、D[avid] P. Reed 和 D[avid] D. Clark。《系统设计中的端到端论证》。《ACM 计算机系统学报》第 2、4期(1984 年 11 月),第 277-288 页。早期版本出现在《第二届分布式计算系统国际会议论文集》(1981 年 4 月)第 504-512 页。本文提出了一种设计原理,用于确定哪些功能属于分层网络实现的哪些层。这是为数不多的提供系统设计原则的论文之一。
7.3.3 J[erome] H. Saltzer, D[avid]. P. Reed, and D[avid]. D. Clark. End-to-end arguments in system design. ACM Transactions on Computer Systems 2, 4 (November 1984), pages 277–288. An earlier version appears in the Proceedings of the Second International Conference on Distributed Computing Systems (April 1981), pages 504–512.This paper proposes a design rationale for deciding which functions belong in which layers of a layered network implementation. It is one of the few papers available that provides a system design principle.
7.3.4Leonard Kleinrock。千兆网络中的延迟/带宽权衡。IEEE通信杂志 30,4(1992 年 4 月),第 36-40 页。
7.3.4 Leonard Kleinrock. The latency/bandwidth trade-off in gigabit networks. IEEE Communications Magazine 30, 4 (April 1992), pages 36–40.
技术已经使千兆/秒数据速率在长距离上具有经济可行性。但是长距离和高数据速率共同改变了分组网络的一些基本特性——延迟成为限制应用的主要因素。本文对这个问题进行了很好的解释。
Technology has made gigabit/second data rates economically feasible over long distances. But long distances and high data rates conspire to change some fundamental properties of a packet network—latency becomes the dominant factor that limits applications. This paper provides a very good explanation of the problem.
要了解有关互联网协议的完整知识,请查阅以下系列书籍。
For the complete word on the Internet protocols, check out the following series of books.
7.4.1W. Richard Stevens。TCP /IP 图解。Addison-Wesley;第 1 卷,1994 年,ISBN:0–201–63346–9,576 页;第 2 卷(与合著者 Gary R. Wright 合著),1995 年,ISBN:0–201–63354–x,1174 页。;第 3 卷,1996 年,ISBN:0–201–63495–3,328 页。第 1 卷:协议。第 2 卷:实现。第 3 卷:用于事务的 TCP、HTTP、NNTP 和UNIX ®域协议。这三卷书将以 Berkeley System Distribution 的网络实现为参考,告诉你更多关于 TCP/IP 实现方式的信息。“图解”一词更多是指计算机打印输出(数据包跟踪和程序列表),而不是图表。如果你想知道互联网协议套件的某些方面是如何实际实现的,这就是你要看的地方——尽管它通常不会解释为什么做出特定的实现选择。
7.4.1 W. Richard Stevens. TCP/IP illustrated. Addison-Wesley; v. 1, 1994, ISBN: 0–201–63346–9, 576 pages; v. 2 (with co-author Gary R. Wright,) 1995, ISBN: 0–201–63354–x, 1174 pages.; v. 3, 1996, ISBN: 0–201–63495–3, 328 pages. Volume 1: The Protocols. Volume 2: The Implementation. Volume 3: TCP for Transactions, HTTP, NNTP, and the UNIX ® Domain Protocols.These three volumes will tell you more than you wanted to know about how TCP/IP is implemented, using the network implementation of the Berkeley System Distribution for reference. The word “illustrated” refers more to computer printouts—listings of packet traces and programs—than to diagrams. If you want to know how some aspect of the Internet protocol suite is actually implemented, this is the place to look—though it does not often explain why particular implementation choices were made.
许多系统都设计了一定程度的容错方案。有关分布式文件系统容错的示例,请参阅 Kistler 和 Satyanarayanan 撰写的有关 Coda 的论文,阅读10.1.2。另请参阅 Katz 等人撰写的有关 RAID 的论文,阅读 10.2.2。
A plan for some degree of fault tolerance shows up in many systems. For an example of fault tolerance in distributed file systems, see the paper on Coda by Kistler and Satyanarayanan, reading 10.1.2. See also the paper on RAID by Katz et al., reading 10.2.2.
Gray 和 Reuter 所著书籍的第 3 章阅读1.1.5提供了有关这一主题的基础文本。
Chapter 3 of the book by Gray and Reuter, reading 1.1.5, provides a bedrock text on this subject.
8.1.1Jim [N.] Gray 和 Daniel P. Siewiorek。高可用性计算机系统。Computer 24,9(1991 年 9 月),第 39-48 页。这是一个非常好的、易于阅读的关于如何实现高可用性的概述。
8.1.1 Jim [N.] Gray and Daniel P. Siewiorek. High-availability computer systems. Computer 24, 9 (September 1991), pages 39–48.This is a very nice, easy-to-read overview of how high availability can be achieved.
8.1.2Daniel P. Siewiorek。容错计算机的体系结构。计算机 17,8(1984 年 8 月),第 9-18 页。
8.1.2 Daniel P. Siewiorek. Architecture of fault-tolerant computers. Computer 17, 8 (August 1984), pages 9–18.
本文提供了一个极好的分类法,并概述了几种设计计算机的架构方法,即使单个硬件组件出现故障,计算机也能继续运行。
This paper provides an excellent taxonomy, as well as a good overview of several architectural approaches to designing computers that continue running even when a single hardware component fails.
8.2.1Dawson Engler 等人。Bug 作为异常行为:推断系统代码错误的通用方法。第十八届 ACM 操作系统原理研讨会论文集,2001 年,载于《操作系统评论》第 35 卷,5(2001 年 12 月),第 57-72 页。
8.2.1 Dawson Engler et al. Bugs as deviant behavior: A general approach to inferring errors in systems code. Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles, 2001, in Operating Systems Review 35, 5 (December 2001), pages 57–72.
本文介绍了一种通过查找不一致性来查找大型系统中可能存在的编程错误的方法。例如,如果在大多数情况下,某个函数的调用之前会先禁用中断,但在少数情况下则不会,则很有可能存在编程错误。本文利用这一见解创建了一种查找大型系统中潜在错误的工具。
This paper describes a method for finding possible programming faults in large systems by looking for inconsistencies. For example, if in most cases an invocation of a certain function is preceded by disabling interrupts but in a few cases it is not, there is a good chance that a programming fault is present. The paper uses this insight to create a tool for finding potential faults in large systems.
8.2.2Michael M. Swift 等。《恢复设备驱动程序》。《第六届操作系统设计和实现研讨会论文集》(2004 年 12 月),第 1-16 页。
8.2.2 Michael M. Swift et al. Recovering device drivers. Proceedings of the Sixth Symposium on Operating Systems Design and Implementation (December 2004), pages 1–16.
本文指出,设备驱动程序中的软件故障经常会导致致命错误,从而导致操作系统出现故障并需要重新启动。然后,本文介绍了如何使用虚拟内存技术来加强设备驱动程序与操作系统内核其余部分之间的模块化,以及操作系统如何在设备驱动程序出现故障时恢复它们,从而减少重新启动的次数。
This paper observes that software faults in device drivers often lead to fatal errors that cause operating systems to fail and thus require a reboot. It then describes how virtual memory techniques can be used to enforce modularity between device drivers and the rest of the operating system kernel, and how the operating system can recover device drivers when they fail, reducing the number of reboots.
8.3.1Bianca Schroeder 和 Garth A. Gibson。现实世界中的磁盘故障:1,000,000 小时的 MTTF 对您来说意味着什么?第五届 USENIX文件和存储技术会议论文集(2007),第 1-16 页。
8.3.1 Bianca Schroeder and Garth A. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Proceedings of the Fifth USENIX Conference on File and Storage Technologies (2007), pages 1–16.
如第 8.2 节所述,磁盘驱动器的数据表通常将 MTTF 指定为 100 年或更长,这是现场实际观察到的这些驱动器寿命的许多倍。本文研究了 100,000 个磁盘驱动器的磁盘更换数据,并讨论了 MTTF 对这些磁盘驱动器意味着什么。
As explained in Section 8.2, it is not uncommon that data sheets for disk drives specify MTTFs of one hundred years or more, many times the actual observed lifetimes of those drives in the field. This paper looks at disk replacement data for 100,000 disk drives and discusses what MTTF means for those disk drives.
8.3.2Eduardo Pinheiro、Wolf-Dietrich Weber 和 Luiz Andre Barroso。大量磁盘驱动器中的故障趋势。第五届 USENIX 文件和存储技术会议论文集(2007),第 17-28 页。最近,像 Google 这样的公司已经部署了足够多的现成磁盘驱动器,并且部署时间足够长,因此他们可以自行评估磁盘驱动器的故障率和使用寿命,以便与磁盘供应商的先验可靠性模型进行比较。本文报告了从此类观察中收集的数据。它分析了故障与几个通常被认为会影响磁盘寿命的参数之间的相关性,并发现了一些令人惊讶的结果。例如,它报告说,只要温度在一定范围内且稳定,温度与磁盘驱动器故障的相关性就会比以前报告的要小。
8.3.2 Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz Andre Barroso. Failure trends in a large disk drive population. Proceedings of the Fifth USENIX Conference on File and Storage Technologies (2007), pages 17–28.Recently, outfits such as Google have deployed large enough numbers of off-the-shelf disk drives for a long enough time that they can make their own evaluations of disk drive failure rates and lifetimes, for comparison with the a priori reliability models of the disk vendors. This paper reports data collected from such observations. It analyzes the correlation between failures and several parameters that are generally believed to impact the lifetime of disk and finds some surprises. For example, it reports that temperature is less correlated with disk drive failure than was previously reported, as long as the temperature is within a certain range and stable.
关于这个主题的最佳资料来源是阅读1.1.5,但是 Gray 和 Reuter 的上千页的书可能有点让人不知所措。
The best source on this topic is reading 1.1.5, but Gray and Reuter’s thousand-page book can be a bit overwhelming.
9.1.1Warren A. Montgomery。分布式信息系统的鲁棒并发控制。博士论文,麻省理工学院,电子工程与计算机科学系,1978 年 12 月。也可作为麻省理工学院计算机科学实验室技术报告 TR-207 使用,1979 年 1 月。197 页。
9.1.1 Warren A. Montgomery. Robust Concurrency Control for a Distributed Information System. Ph.D. Thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, December 1978. Also available as M.I.T. Laboratory for Computer Science Technical Report TR-207, January 1979. 197 pages.
这项工作描述了在实现原子性的同时最大化并发活动的替代策略:为某些变量维护多个值,原子广播消息以实现正确的序列。
This work describes alternative strategies that maximize concurrent activity while achieving atomicity: maintaining multiple values for some variables, atomic broadcast of messages to achieve proper sequence.
9.1.2DB Lomet。使用原子操作进行进程构造、同步和恢复。ACM可靠软件语言设计会议论文集(1977 年 3 月),第 128-137 页。已发布为ACM SIGPLAN 通知 12,3(1977 年 3 月);操作系统评论 11,2(1977 年 4 月);以及软件工程说明 2,2(1977 年 3 月)。
9.1.2 D. B. Lomet. Process structuring, synchronization, and recovery using atomic actions. Proceedings of an ACM Conference on Language Design for Reliable Software (March 1977), pages 128–137. Published as ACM SIGPLAN Notices 12, 3 (March 1977); Operating Systems Review 11, 2 (April 1977); and Software Engineering Notes 2, 2 (March 1977).
这是将原子性与恢复和协调联系起来的首次尝试之一。它是从语言而不是实现的角度编写的。
This is one of the first attempts to link atomicity to both recovery and coordination. It is written from a language, rather than an implementation, perspective.
9.2.1Jim [N.] Gray 等人。系统 R 数据库管理器的恢复管理器。ACM计算调查 13,2(1981 年 6 月),第 223-242 页。
9.2.1 Jim [N.] Gray et al. The recovery manager of the system R database manager. ACM Computing Surveys 13, 2 (June 1981), pages 223–242.
本文是对一个复杂、真实、高性能的日志和锁定系统的案例研究。它是此类案例研究中最有趣的一个,因为它展示了构建一个性能良好的系统所需的各种不同的、相互作用的机制。
This paper is a case study of a sophisticated, real, high-performance logging and locking system. It is one of the most interesting case studies of its type because it shows the number of different, interacting mechanisms needed to construct a system that performs well.
9.2.2C. Mohan 等。ARIES:一种使用预写日志支持细粒度锁定和部分回滚的事务恢复方法。ACM Transactions on Database Systems 17,1(1992 年),第 94-162 页。本文描述了使用预写日志功能的全功能、商业质量数据库事务系统的所有复杂设计细节。
9.2.2 C. Mohan et al. ARIES: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems 17, 1 (1992), pages 94–162.This paper describes all the intricate design details of a fully featured, commercial-quality database transaction system that uses write-ahead logging.
9.2.3C. Mohan、Bruce Lindsey 和 Ron Obermarck。R* 分布式数据库管理系统中的事务管理。ACM数据库系统事务 (TODS) 11,4(1986 年 12 月),第 378-396 页。
9.2.3 C. Mohan, Bruce Lindsey, and Ron Obermarck. transaction management in the R* distributed database management system. ACM Transactions on Database Systems (TODS) 11, 4 (December 1986), pages 378–396.
本文讨论了分布式数据库的事务管理,并介绍了两种新协议(假定中止和假定提交),它们优化了两阶段提交(参见第 9.6 节),从而减少了消息和日志写入。假定中止针对仅执行读取操作的事务进行了优化,而假定提交针对涉及多个分布式数据库的更新事务进行了优化。
This paper deals with transaction management for distributed databases, and introduces two new protocols (Presumed Abort and Presumed Commit) that optimize two-phase commit (see Section 9.6), resulting in fewer messages and log writes. Presumed Abort is optimized for transactions that perform only read operations, and Presumed Commit is optimized for transactions with updates that involve several distributed databases.
9.2.4Tom Barclay、Jim Gray 和 Don Slutz。Microsoft TerraServer:空间数据仓库。Microsoft技术报告 MS-TR-99–29。1999年 6 月。
9.2.4 Tom Barclay, Jim Gray, and Don Slutz. Microsoft TerraServer: A spatial data warehouse. Microsoft Technical Report MS-TR-99–29. June 1999.
作者报告了如何使用现成的组件(包括用于存储 TB 级数据的标准数据库系统)构建一个流行的网站,该网站可存储地球的航空、卫星和地形图像。
The authors report on building a popular Web site that hosts aerial, satellite, and topographic images of Earth using a off-the-shelf components, including a standard database system for storing the terabytes of data.
9.2.5Ben Vandiver 等。使用提交屏障调度容忍事务处理系统中的拜占庭错误。第二十一届 ACM 操作系统原理研讨会论文集,载于《操作系统评论》41,6(2005 年 12 月),第 59-79 页。
9.2.5 Ben Vandiver et al. Tolerating byzantine faults in transaction processing systems using commit barrier scheduling. Proceedings of the Twenty-first ACM Symposium on Operating Systems Principles, in Operating Systems Review 41, 6 (December 2005), pages 59–79.
本文介绍了一种用于处理数据库系统中拜占庭故障的复制方案。它向未修改的现成数据库系统的多个副本发出查询和更新,并比较它们的响应,从而创建一个具有拜占庭容错能力的单一数据库(有关拜占庭的定义,请参阅第 8.6 节)。
This paper describes a replication scheme for handling Byzantine faults in database systems. It issues queries and updates to multiple replicas of unmodified, off-the-shelf database systems, and it compares their responses, thus creating a single database that is Byzantine fault tolerant (see Section 8.6 for the definition of Byzantine).
9.3.1Mendel Rosenblum 和 John K. Ousterhout。日志结构文件系统的设计和实现。ACM计算机系统学报 10,1(1992 年 2 月),第 26-52 页。最初发表于第十三届 ACM 操作系统原理研讨会论文集,载于操作系统评论 25,5(1991 年 12 月),第 1-15 页。
9.3.1 Mendel Rosenblum and John K. Ousterhout. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems 10, 1 (February 1992), pages 26–52. Originally published in Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, in Operating Systems Review 25, 5 (December 1991), pages 1–15.
尽管长期以来人们一直认为,理论上可以将文件系统的内容以有限日志的形式存储在磁盘上,但这种设计是少数能够充分展示该设计策略含义的设计之一。本文还提供了一个很好的例子,说明如何通过仔细定义目标、测量以前的系统以获得基准,然后比较性能以及无法测量的功能方面来解决系统问题。
Although it has long been suggested that one could in principle store the contents of a file system on disk in the form of a finite log, this design is one of the few that demonstrates the full implications of that design strategy. The paper also presents a fine example of how to approach a system problem by carefully defining the objective, measuring previous systems to obtain a benchmark, and then comparing performance as well as functional aspects that cannot be measured.
9.3.2HT Kung 和 John T. Robinson。关于并发控制的乐观方法。ACM Transactions on Database Systems 9,4(1981 年 6 月),第 213-236 页。这篇早期论文介绍了使用乐观方法来控制共享数据更新的想法。乐观方案是指事务在进行时希望其更新不会与另一个事务的并发更新发生冲突。在提交时,事务会检查希望是否合理。如果合理,则事务提交。如果不合理,则事务中止并重试。与始终获取锁来协调更新的方案相比,使用这种乐观方案的应用程序可能会比使用始终获取锁来协调更新的方案更高效地运行。
9.3.2 H. T. Kung and John T. Robinson. On optimistic methods for concurrency control. ACM Transactions on Database Systems 9, 4 (June 1981), pages 213–236.This early paper introduced the idea of using optimistic approaches to controlling updates to shared data. An optimistic scheme is one in which a transaction proceeds in the hope that its updates are not conflicting with concurrent updates of another transaction. At commit time, the transaction checks to see if the hope was justified. If so, the transaction commits. If not, the transaction aborts and tries again. Applications that use a database in which contention for particular records is infrequent may run more efficiently with this optimistic scheme than with a scheme that always acquires locks to coordinate updates.
另请参阅 Lampson 和 Sturgis 的论文,阅读1.8.7以及 Ganger 和 Patt 的论文,阅读 6.3.3。
See also the paper by Lampson and Sturgis, reading 1.8.7 and the paper by Ganger and Patt, reading 6.3.3.
10.1.1JR Goodman。使用高速缓存减少处理器内存流量。《第十届计算机体系结构国际研讨会论文集》,第 124-132 页(1983 年)。
10.1.1 J. R. Goodman. Using cache memory to reduce processor-memory traffic. Proceedings of the 10th Annual International Symposium on Computer Architecture, pages 124–132 (1983).
这篇论文介绍了一种使用 snoopy 缓存的缓存一致性共享内存协议。该论文还引发了大量关于缓存一致性共享内存更具可扩展性设计的研究。
The paper that introduced a protocol for cache-coherent shared memory using snoopy caches. The paper also sparked much research in more scalable designs for cache-coherent shared memory.
10.1.2James J. Kistler 和 M[ahadarev] Satyanarayanan。Coda 文件系统中的断开连接操作。《第十三届 ACM 操作系统原理研讨会论文集》,载《操作系统评论》第 25 卷第 5 期(1991 年 12 月),第 213-225 页。
10.1.2 James J. Kistler and M[ahadarev] Satyanarayanan. Disconnected operation in the Coda file system. Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, in Operating Systems Review 25, 5 (December 1991), pages 213–225.
Coda 是 Andrew 文件系统 (AFS) 的一个变体,它提供了额外的容错功能。它使用相同的底层机制来处理由于网络分区导致的意外断开连接和与便携式计算机相关的故意断开连接。这篇论文写得非常好。
Coda is a variation of the Andrew File System (AFS) that provides extra fault tolerance features. It is notable for using the same underlying mechanism to deal both with accidental disconnection due to network partition and the intentional disconnection associated with portable computers. This paper is very well written.
10.1.3Jim Gray 等人。复制的危险和解决方案。1996年 ACM SIGMOD 国际数据管理会议论文集,载于ACM SIGMOD 记录 25,2(1996 年 6 月),第 173-182 页。
10.1.3 Jim Gray et al. The dangers of replication and a solution. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, in ACM SIGMOD Record 25, 2 (June 1996), pages 173–182.
本文介绍了在副本存储在经常断开连接的移动计算机上的情况下复制协议面临的挑战。本文认为,在这种情况下尝试为乐观复制协议提供事务语义是不稳定的,因为会出现太多协调冲突。它提出了一种用于协调断开连接的副本的新型两层协议来解决这个问题。
This paper describes the challenges for replication protocols in situations where the replicas are stored on mobile computers that are frequently disconnected. The paper argues that trying to provide transactional semantics for an optimistic replication protocol in this setting is unstable because there will be too many reconciliation conflicts. It proposes a new two-tier protocol for reconciling disconnected replicas that addresses this problem.
10.1.4Leslie Lamport。Paxos 变得简单。分布式计算 (专栏),ACM SIGACT News 32,4 (Whole Number 121,2001 年 12 月),第 51-58 页。本文以简单的方式描述了一种复杂的协议 Paxos。Paxos 协议允许多台计算机在网络和计算机发生故障时就某个值(例如,复制服务中可用的计算机列表)达成一致。它是构建容错服务的重要组成部分。
10.1.4 Leslie Lamport. Paxos made simple. Distributed computing (column), ACM SIGACT News 32, 4 (Whole Number 121, December 2001), pages 51–58.This paper describes an intricate protocol, Paxos, in a simple way. The Paxos protocol allows several computers to agree on a value (e.g., the list of available computers in a replicated service) in the face of network and computer failures. It is an important building block in building fault tolerant services.
10.1.5Fred Schneider。使用状态机方法实现容错服务:教程。ACM计算调查 22,4(1990 年),第 299-319 页。
10.1.5 Fred Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys 22, 4 (1990), pages 299–319.
本文清晰地描述了构建容错服务最流行的方法之一——复制状态机方法。
This paper provides a clear description of one of the most popular approaches for building fault tolerant services, the replicated-state machine approach.
10.1.6Leslie Lamport。分布式系统中的时间、时钟和事件排序。ACM通讯 21,7(1978 年),第 558-565 页。
10.1.6 Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21, 7 (1978), pages 558–565.
本文介绍了一种现在称为 Lamport 时钟的概念。Lamport 时钟为分布式系统提供全局逻辑时钟,该时钟尊重组成分布式系统的计算机的物理时钟以及它们之间的通信。本文还介绍了复制状态机的概念。
This paper introduces an idea that is now known as Lamport clocks. A Lamport clock provides a global, logical clock for a distributed system that respects the physical clocks of the computers comprising the distributed system and the communication between them. The paper also introduces the idea of replicated state machines.
10.1.7David K. Gifford。《复制数据的加权投票》。《第七届 ACM 操作系统原理研讨会论文集》,载于《操作系统评论》第 13 卷,第 5 期(1979 年 12 月),第 150-162 页。也可在施乐帕洛阿尔托研究中心技术报告 CSL-79-14(1979 年 9 月)中找到。
10.1.7 David K. Gifford. Weighted voting for replicated data. Proceedings of the Seventh ACM Symposium on Operating Systems Principles, in Operating Systems Review 13, 5 (December 1979), pages 150–162. Also available as Xerox Palo Alto Research Center Technical Report CSL–79–14 (September 1979).
该工作讨论了一种复制数据算法,该算法通过为每个数据副本分配权重并要求事务在读取或写入之前收集这些权重的法定人数来调整可靠性和性能之间的权衡。
The work discusses a replicated data algorithm that allows the trade-off between reliability and performance to be adjusted by assigning weights to each data copy and requiring transactions to collect a quorum of those weights before reading or writing.
10.1.8Kai Li 和 Paul Hudak。共享虚拟内存系统中的内存一致性。ACM计算机系统学报 7,4(1989 年 11 月),第 321-359 页。
10.1.8 Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems 7, 4 (November 1989), pages 321–359.
本文介绍了一种在多台仅能通过消息进行通信的独立计算机之间创建共享虚拟内存的方法。该方法利用虚拟内存的硬件支持,使其他计算机上的页面读取器能够观察到对页面的写入结果。其目标是允许程序员以共享内存方式(而不是消息传递方式)在分布式计算机系统上编写并行应用程序。
This paper describes a method to create a shared virtual memory across several separated computers that can communicate only with messages. It uses hardware support for virtual memory to cause the results of a write to a page to be observed by readers of that page on other computers. The goal is to allow programmers to write parallel applications on a distributed computer system in shared-memory style instead of a message-passing style.
10.1.9Sanjay Ghemawat、Howard Gobioff 和 Shun-Tak Leung。Google 文件系统。第十九届 ACM 操作系统原理研讨会论文集(2003 年 10 月),第 29-43 页。另见《操作系统评论》第 37 卷第 5 期(2003 年 12 月)。
10.1.9 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (October 2003), pages 29–43. Also in Operating Systems Review 37, 5 (December 2003).
本文介绍了 Google 众多应用中使用的文件系统。它将集群中数千台计算机的磁盘聚合到具有简单文件系统接口的单个存储系统中。其设计针对大型文件进行了优化,并复制文件以实现容错。Google 文件系统用于 Google 众多应用(包括搜索)的存储后端。
This paper introduces a file system used in many of Google’s applications. It aggregates the disks of thousands of computers in a cluster into a single storage system with a simple file system interface. Its design is optimized for large files and replicates files for fault tolerance. The Google File System is used in the storage back-end of many of Google’s applications, including search.
10.1.10F[ay] Chang 等人。Bigtable:一种用于结构化数据的分布式存储系统。ACM计算机系统学报 26,2篇文章 4(2008 年),第 1-26 页。
10.1.10 F[ay] Chang et al. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems 26, 2 article 4 (2008), pages 1–26.
本文介绍了一种类似数据库的系统,用于在数千台商品服务器上存储 PB 级的结构化数据。
This paper describes a database-like system for storing petabytes of structured data on thousands of commodity servers.
10.2.1Raymond A. Lorie。《数字信息的长期保存》。《第一届 ACM/IEEE 数字图书馆联合会议论文集》(2001 年),第 346-352 页。
10.2.1 Raymond A. Lorie. The long-term preservation of digital information. Proceedings of the First ACM/IEEE Joint Conference on Digital Libraries (2001), pages 346–352.
这是一次关于尽管媒介和技术已经过时但存档数字信息的问题的深思熟虑的讨论。
This is a thoughtful discussion of the problems of archiving digital information despite medium and technology obsolescence.
10.2.2Randy H. Katz、Garth A. Gibson 和 David A. Patterson。高性能计算的磁盘系统架构。IEEE论文集 77,12(1989 年 12 月),第 1842-1857 页。
10.2.2 Randy H. Katz, Garth A. Gibson, and David A. Patterson. Disk system architectures for high performance computing. Proceedings of the IEEE 77, 12 (December 1989), pages 1842–1857.
这篇关于独立磁盘冗余阵列 (RAID) 的参考论文的第一部分回顾了磁盘技术;重要的材料是六种 RAID 组织的目录。
The first part of this reference paper on Redundant Arrays of Independent Disks (RAID) reviews disk technology; the important material is the catalog of six varieties of RAID organization.
10.2.3Petros Maniatis 等人,LOCKSS:一种点对点数字保存系统,ACM Transactions on Computer Systems 23,1(2005 年 2 月),第 2-50 页。
10.2.3 Petros Maniatis et al. LOCKSS: A Peer-to-peer digital preservation system ACM Transactions on Computer Systems 23, 1 (February 2005), pages 2–50.
本文介绍了一种用于保存对网络上发布的期刊和其他档案信息的访问的对等系统。该系统的设计基于“大量副本保证资料安全”(LOCKSS)这一口号。大量持久性 Web 缓存会保存副本,并使用一种新的投票方案协作检测和修复副本损坏。
This paper describes a peer-to-peer system for preserving access to journals and other archival information published on the Web. Its design is based on the mantra “lots of copies keep stuff safe” (LOCKSS). A large number of persistent Web caches keep copies and cooperate to detect and repair damage to their copies using a new voting scheme.
10.2.4A[lan J.] Demers 等人。《复制数据库维护的流行病算法》。《第六届分布式计算原理研讨会论文集》(1987 年 8 月),第 1-12 页。另见《操作系统评论》22,1(1988 年 1 月),第 8-12 页。
10.2.4 A[lan J.] Demers et al. Epidemic algorithms for replicated database maintenance. Proceedings of the Sixth Symposium on Principles of Distributed Computing (August 1987), pages 1–12. Also in Operating Systems Review 22, 1 (January 1988), pages 8–12.
本文介绍了一种用于更新在许多机器上复制的数据的流行病协议。流行病协议的本质是每台计算机定期与随机选择的其他计算机进行交流并交换信息;因此,多台计算机以病毒式的方式了解所有更新。流行病协议可以简单而强大,但可以相对快速地传播更新。
This paper describes an epidemic protocol to update data that is replicated on many machines. The essence of an epidemic protocol is that each computer periodically gossips with some other, randomly chosen computer and exchanges information; multiple computers thus learn about all updates in a viral fashion. Epidemic protocols can be simple and robust, yet can spread updates relatively quickly.
10.3.1Douglas B. Terry 等。《管理 Bayou(一种弱连接复制存储系统)中的更新冲突》。《第 15 届操作系统原理研讨会论文集》(1995 年 12 月),《操作系统评论》第 29 卷,第 5 期(1995 年 12 月),第 172–183 页。
10.3.1 Douglas B. Terry et al. Managing update conflicts in Bayou, a weakly connected replicated storage system. In Proceedings of the 15th Symposium on Operating Systems Principles (December 1995), in Operating Systems Review 29, 5 (December 1995), pages 172–183.
本文介绍了一种用于共享数据但并非始终连接的计算机的复制方案。例如,每台计算机可能都有一个日历副本,它可以乐观地更新该日历。Bayou 将传播这些更新、检测冲突并尝试解决冲突(如果可能)。
This paper introduces a replication scheme for computers that share data but are not always connected. For example, each computer may have a copy of a calendar, which it can update optimistically. Bayou will propagate these updates, detect conflicts, and attempt to resolve conflicts, if possible.
10.3.2Trevor Jim、Benjamin C. Pierce 和 Jérôme Vouillon。如何构建文件同步器。(广为流传的灰色文献——日期为 2002 年 2 月 22 日,但从未发表过。)
10.3.2 Trevor Jim, Benjamin C. Pierce, and Jérôme Vouillon. How to build a file synchronizer. (A widely circulated piece of grey literature—dated February 22, 2002 but never published.)
本文介绍了 Unison 的具体细节,Unison 是一款可以高效同步两台计算机上存储的文件的工具。Unison 的目标用户是将文件存储在多个位置(例如,工作服务器上、旅行时携带的笔记本电脑上以及家中的台式机上)并且希望不同计算机上的所有文件都相同的用户。
This paper describes the nuts and bolts of Unison, a tool that efficiently synchronizes the files stored on two computers. Unison is targeted to users who have their files stored in several places (e.g., on a server at work, a laptop to carry while traveling, and a desktop at home) and would like to have all the files on the different computers be the same.
关于隐私的基础书籍是阅读Alan Westin 的1.1.6 。
The fundamental book about privacy is reading 1.1.6 by Alan Westin.
11.1.1Arthur R. Miller。《侵犯隐私权》。密歇根大学出版社,密歇根州安娜堡,1971 年。ISBN:0–47265500–0。333 页。(已绝版。)
11.1.1 Arthur R. Miller. The Assault on Privacy. University of Michigan Press, Ann Arbor, Michigan, 1971. ISBN: 0–47265500–0. 333 pages. (Out of print.)
本书清晰地阐述了计算机数据收集系统对隐私的潜在影响,以及改善法律保护的可能方法。由于立法的进步,后者的部分内容现已过时,但本书的大部分内容仍然很有趣。
This book articulately spells out the potential effect of computerized data-gathering systems on privacy, and of possible approaches to improving legal protection. Part of the latter is now out of date because of advances in legislation, but most of this book is still of much interest.
11.1.2Daniel J. Weitzner 等。信息责任。《ACM 通讯》51,6(2008 年 6 月),第 82-87 页。
11.1.2 Daniel J. Weitzner et al. Information accountability. Communications of the ACM 51, 6 (June 2008), pages 82–87.
论文指出,在现代社会中,威斯汀的定义仅涵盖了隐私的一小部分。请参阅边栏 11.1,了解有关该论文提出的扩展定义的讨论。
The paper suggests that in the modern world Westin’s definition covers only a subset of privacy. See Sidebar 11.1 for a discussion of the paper’s proposed extended definition.
11.2.1Jerome H. Saltzer 和 Michael D. Schroeder。《计算机系统中的信息保护》。IEEE论文集 63,9(1975 年 9 月),第 1278-1308 页。
11.2.1 Jerome H. Saltzer and Michael D. Schroeder. The protection of information in computer systems. Proceedings of the IEEE 63, 9 (September 1975), pages 1278–1308.
30 年后,这篇论文(当前第 11 章的早期版本)仍然有效地处理了多用户系统中的保护机制。它强调单个系统内部的保护,而不是连接到网络的系统之间的保护,这是它的主要缺点之一,此外还有过时的例子和遗漏了较新的认证技术,例如认证逻辑。
After 30 years, this paper (an early version of the current Chapter 11) still provides an effective treatment of protection mechanics in multiuser systems. Its emphasis on protection inside a single system, rather than between systems connected to a network, is one of its chief shortcomings, along with antique examples and omission of newer techniques of certification such as authentication logic.
11.2.2R[oger] M. Needham。保护系统和保护实施。AFIPS秋季联合会议 41,第一部分(1972 年 12 月),第 571-578 页。这篇论文可能是对能力系统最清晰的解释。有关能力的另一篇重要论文,请参阅 Fabry,阅读3.1.2。
11.2.2 R[oger] M. Needham. Protection systems and protection implementations. AFIPS Fall Joint Conference 41, Part I (December 1972), pages 571–578.This paper is probably as clear an explanation of capability systems as one is likely to find. For another important paper on capabilities, see Fabry, reading 3.1.2.
11.3.1Butler [W.] Lampson、Martín Abadi、Michael Burrows 和 Edward Wobber。分布式系统中的身份验证:理论与实践。ACM Transactions on Computer Systems 10,4(1992 年 11 月),第 265-310 页。
11.3.1 Butler [W.] Lampson, Martín Abadi, Michael Burrows, and Edward Wobber. Authentication in distributed systems: Theory and practice. ACM Transactions on Computer Systems 10, 4 (November 1992), pages 265–310.
本文是关于可用于系统地推理身份验证的逻辑的系列论文之一,它对该理论进行了相对完整的阐述,并展示了如何将其应用于分布式系统的协议。
This paper, one of a series on a logic that can be used to reason systematically about authentication, provides a relatively complete explication of the theory and shows how to apply it to the protocols of a distributed system.
11.3.2Edward Wobber、Martín Abadi、Michael Burrows 和 Butler W. Lampson。Taos 操作系统中的身份验证。第十四届 ACM 操作系统原理研讨会论文集,载于《操作系统评论》第 27 卷第 5 期(1993 年 12 月),第 256–369 页。
11.3.2 Edward Wobber, Martín Abadi, Michael Burrows, and Butler W. Lampson. Authentication in the Taos operating system. Proceedings of the Fourteenth ACM Symposium on Operating Systems Principles, in Operating Systems Review 27, 5 (December 1993), pages 256–369.
本文将阅读11.3.1中开发的身份验证逻辑应用于实验操作系统。除了提供具体示例外,对身份验证逻辑本身的解释也比另一篇论文中的内容更容易理解。
This paper applies the authentication logic developed in reading 11.3.1 to an experimental operating system. In addition to providing a concrete example, the explanation of the authentication logic itself is a little more accessible than that in the other paper.
11.3.3Ken L. Thompson。关于信任的思考。《ACM 通讯》27,8(1984 年 8 月),第 761-763 页。
11.3.3 Ken L. Thompson. Reflections on trusting trust. Communications of the ACM 27, 8 (August 1984), pages 761–763.
任何认真考虑开发可信计算机系统的人都应该认真思考本文提出的验证含义。汤普森展示了编译器专家可以多么轻松地将无法检测到的特洛伊木马插入系统。阅读 11.3.4描述了一种检测特洛伊木马的方法。[汤普森描述的最初想法来自一篇论文,他当时想不起这篇论文是谁,这篇论文的脚注中请求帮助找到它。这篇论文是汉斯科姆空军基地美国空军电子系统部的技术报告。Paul A. Karger 和 Roger R. Schell。Multics安全评估:漏洞分析。ESD –TR–74–193,第 II 卷(1974 年 6 月),第 52 页。]
Anyone seriously interested in developing trusted computer systems should think hard about the implications for verification that this paper raises. Thompson demonstrates the ease with which a compiler expert can insert undetectable Trojan Horses into a system. reading 11.3.4 describes a way to detect a Trojan horse. [The original idea that Thompson describes came from a paper whose identity he could not recall at the time, and which is credited with a footnote asking for help locating it. The paper was a technical report of the United States Air Force Electronic Systems Division at Hanscom Air Force Base. Paul A. Karger and Roger R. Schell. Multics Security Evaluation: Vulnerability Analysis. ESD–TR–74–193, Volume II (June 1974), page 52.]
11.3.4David A. Wheeler。通过多样化双重编译来对抗信任。第 21 届年度计算机安全应用会议论文集(2005),第 28-40 页。
11.3.4 David A. Wheeler. Countering trusting trust through diverse double-Compiling. Proceedings of the 21st Annual Computer Security Applications Conference (2005), pages 28–40.
本文提出了一种解决方案,作者称之为“多样化双重编译”,以检测 Thompson 关于信任的论文中讨论的攻击(参见阅读 11.3.3)。这个想法是重新编译一个新的、不受信任的编译器的源代码两次:首先,使用受信任的编译器,其次,使用此编译的结果。如果编译器生成的二进制文件与不受信任的编译器的原始二进制文件逐位相同,则源代码准确地表示了不受信任的二进制文件,这是对新编译器产生信任的第一步。
This paper proposes a solution that the author calls “diverse double compiling”, to detect the attack discussed in Thompson’s paper on trusting trust (see reading 11.3.3). The idea is to recompile a new, untrusted compiler’s source code twice: first, using a trusted compiler, and second, using the result of this compilation. If the resulting binary for the compiler is bit-for-bit identical with the untrusted compiler’s original binary, then the source code accurately represents the untrusted binary, which is the first step in developing trust in the new compiler.
11.3.5Paul A. Karger 等。适用于 VAX 架构的 VMM 安全内核。1990年 IEEE 计算机学会安全与隐私研讨会(1990 年 5 月),第 2-19 页。
11.3.5 Paul A. Karger et al. A VMM security kernel for the VAX architecture. 1990 IEEE Computer Society Symposium on Security and Privacy (May 1990), pages 2–19.
20 世纪 70 年代,美国国防部开展了一项研究,旨在创建用于国防目的的可信计算机系统,并在此过程中撰写了大量有关该主题的文献。本文将该文献中的大部分相关思想提炼为一个可读的案例研究,并为那些希望详细了解这些思想的人提供了其他重要论文的参考。
In the 1970’s, the U.S. Department of Defense undertook a research effort to create trusted computer systems for defense purposes and in the process created a large body of literature on the subject. This paper distills most of the relevant ideas from that literature into a single, readable case study, and it also provides pointers to other key papers for those seeking more details on these ideas.
11.3.6David D. Clark 和 David. R. Wilson。商业和军事计算机安全政策的比较。1987年 IEEE 安全与隐私研讨会(1987 年 4 月),第 184-194 页。
11.3.6 David D. Clark and David. R. Wilson. A Comparison of commercial and military computer security policies. 1987 IEEE Symposium on Security and Privacy (April 1987), pages 184–194.
这篇发人深省的论文概述了商业环境中对安全策略的要求,并指出格模型通常不适用。它建议这些应用程序需要一种更面向对象的模型,其中数据只能由受信任的程序修改。
This thought-provoking paper outlines the requirements for security policy in commercial settings and argues that the lattice model is often not applicable. It suggests that these applications require a more object-oriented model in which data may be modified only by trusted programs.
11.3.7Jaap-Henk Hoepman 和 Bart Jacobs。通过开源提高安全性。ACM 50 通讯,1(2007 年 1 月),第 79-83 页。
11.3.7 Jaap-Henk Hoepman and Bart Jacobs. Increased security through open source. Communications of the ACM 50, 1 (January 2007), pages 79–83.
长期以来,人们一直认为开放设计原则(见第 11.1.4 节)对于设计安全系统非常重要。本文扩展了这一论点,指出系统的源代码可用性对于确保其实施的安全性非常重要。
It has long been argued that the open design principle (see Section 11.1.4) is important to designing secure systems. This paper extends that argument by making the case that the availability of source code for a system is important in ensuring the security of its implementation.
另请参阅Garfinkel 和 Spafford 的阅读1.3.15 、Lampson 和 Sturgis 的阅读5.2.1以及Schroeder、Clark 和 Saltzer 的阅读5.2.2 。
See also reading 1.3.15 by Garfinkel and Spafford, reading 5.2.1 by Lampson and Sturgis, and reading 5.2.2 by Schroeder, Clark, and Saltzer.
11.4.1Robert [H.] Morris 和 Ken [L.] Thompson。密码安全:案例历史。ACM通讯 22,11(1979 年 11 月),第 594-597 页。
11.4.1 Robert [H.] Morris and Ken [L.] Thompson. Password security: A case history. Communications of the ACM 22, 11 (November 1979), pages 594–597.
本文是解释某事的典范,用最少的术语和历史发展来简化读者的理解,描述了UNIX密码安全机制。
This paper is a model of how to explain something in an accessible way. With a minimum of jargon and an historical development designed to simplify things for the reader, it describes the UNIX password security mechanism.
11.4.2Frank Stajano 和 Ross J. Anderson。《复活的小鸭子:临时无线网络的安全问题》。安全协议研讨会 1999,第 172-194 页。本文讨论了新设备(例如监控摄像头)如何与设备所有者的遥控器建立安全关系,而不是与邻居或对手的遥控器建立安全关系。本文的解决方案是,设备将识别第一个向其发送身份验证密钥的主体为其所有者。设备收到密钥后,其状态会从新生儿变为已印记,并且它会一直忠实于该密钥,直到死亡。本文以小鸭子如何验证母亲身份的生动比喻来说明问题和解决方案(见边栏 11.5)。
11.4.2 Frank Stajano and Ross J. Anderson. The resurrecting duckling: Security issues for ad-hoc wireless networks. Security Protocols Workshop 1999, pages 172–194.This paper discusses the problem of how a new device (e.g., a surveillance camera) can establish a secure relationship with the remote controller of the device’s owner, instead of its neighbor’s or adversary’s. The paper’s solution is that a device will recognize as its owner the first principal that sends it an authentication key. As soon as the device receives a key, its status changes from newborn to imprinted, and it stays faithful to that key until its death. The paper illustrates the problem and solution, using a vivid analogy of how ducklings authenticate their mother (see Sidebar 11.5).
11.4.3David Mazières。自认证文件系统。麻省理工学院电气工程与计算机科学系博士论文(2000 年 5 月)。
11.4.3 David Mazières. Self-certifying File System. Ph.D. Thesis, Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science (May 2000).
本论文提出了一种跨管理域文件系统的设计,该系统使用一种称为自认证路径名的理念将文件系统与安全机制分开。在其他几个系统中也可以找到自认证名称。
This thesis proposes a design for a cross-administrative domain file system that separates the file system from the security mechanism using an idea called self-certifying path names. Self-certifying names can be found in several other systems.
另请参阅侧边栏 11.6 有关 Kerberos 的内容以及阅读材料3.2.5,其中使用加密技术来保护个人命名系统。
See also Sidebar 11.6 on Kerberos and reading 3.2.5, which uses cryptographic techniques to secure a personal naming system.
有关应用于计算机系统的密码学的基础书籍是Bruce Schneier 撰写的阅读1.2.4和Alfred Menezes 等人撰写的阅读1.3.13。根据这两本书,下面列出的 20 世纪 70 年代的前几篇论文主要具有历史意义。Simson Garfinkel 撰写的书籍阅读1.3.15也对密码学进行了很好的、更基础的论述。请注意,所有这些书籍和论文都侧重于密码学的应用,而不是密码数学,这是一个独特的专业领域,未在本阅读清单中涵盖。一本易于理解的密码数学参考书是阅读1.3.14。
The fundamental books about cryptography applied to computer systems are reading 1.2.4, by Bruce Schneier, and reading 1.3.13, by Alfred Menezes et al. In light of these two books, the first few papers from the 1970’s listed below are primarily of historical interest. There is also a good, more elementary, treatment of cryptography in the book by Simson Garfinkel, reading 1.3.15. Note that all of these books and papers focus on the application of cryptography, not on crypto-mathematics, which is a distinct area of specialization not covered in this reading list. An accessible crypto-mathematics reference is reading 1.3.14.
11.5.1R[onald] L. Rivest、A[di] Shamir 和 L[en] Adleman。获取数字签名和公钥密码系统的方法。《ACM 通讯》21,2(1978 年 2 月),第 120-126 页。
11.5.1 R[onald] L. Rivest, A[di] Shamir, and L[en] Adleman. A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM 21, 2 (February 1978), pages 120–126.
这篇论文首次提出了可能可行的公钥系统。
This paper was the first to suggest a possibly workable public key system.
11.5.2Whitfield Diffie 和 Martin E. Hellman。《NBS 数据加密标准的详尽密码分析》。《计算机》10,6(1977 年 6 月),第 74-84 页。
11.5.2 Whitfield Diffie and Martin E. Hellman. Exhaustive cryptanalysis of the NBS Data Encryption Standard. Computer 10, 6 (June 1977), pages 74–84.
这是对如何通过强力攻击破解 DES 的非正式分析——通过制造专用芯片并将它们并行排列。25 年后,强力攻击似乎仍然是唯一有希望破解 DES 的方法,但硬件技术的进步使得专用芯片不再必要——互联网上的个人计算机阵列就可以完成这项工作。高级加密标准 (AES) 是 DES 的继任者(参见第 11.9.3.1 节)。
This is the unofficial analysis of how to break the DES by brute force—by building special-purpose chips and arraying them in parallel. Twenty-five years later, brute force still seems to be the only promising attack on DES, but the intervening improvements in hardware technology make special chips unnecessary—an array of personal computers on the Internet can do the job. The Advanced Encryption Standard (AES) is DES’s successor (see Section 11.9.3.1).
11.5.3Ross J. Anderson。《密码系统为何失败》。《ACM 通讯》37,11(1994 年 11 月),第 32-40 页。Anderson 非常出色地分析了现实世界中密码系统的错误之处(安全模块不一定会导致安全系统),以及系统思维在密码系统设计中的适用性。他指出,仅仅做到尽可能好的设计是不够的;根据现场经验纠正设计错误的反馈回路也是一个同样重要的组成部分,但有时会被遗忘。
11.5.3 Ross J. Anderson. Why cryptosystems fail. Communications of the ACM 37, 11 (November 1994), pages 32–40.Anderson presents a very nice analysis of what goes wrong in real-world cryptosystems—secure modules don’t necessary lead to secure systems—and the applicability of systems thinking in their design. He points out that merely doing the best possible design isn’t enough; a feedback loop that corrects errors in the design following experience in the field is an equally important component that is sometimes forgotten.
11.5.4David Wagner 和 Bruce Schneier。SSL 3.0 协议分析。第二届 USENIX 电子商务研讨会论文集第 2 卷(1996 年 11 月),第 29-40 页。
11.5.4 David Wagner and Bruce Schneier. Analysis of the SSL 3.0 protocol. Proceedings of the Second USENIX Workshop on Electronic Commerce, Volume 2 (November 1996), pages 29–40.
本文非常有用,不仅因为它仔细分析了主题协议的安全性,还以比协议规范文档更易于理解的形式解释了协议的工作原理。最初发布的版本几乎立即进行了修订并进行了更正。修订后的版本可在万维网上找到,网址为< http://www.counterpane.com/ssl.html >。
This paper is useful not only because it provides a careful analysis of the security of the subject protocol, but it also explains how the protocol works in a form that is more accessible than the protocol specification documents. The originally published version was almost immediately revised with corrections. The revised version is available on the World Wide Web at <http://www.counterpane.com/ssl.html>.
11.5.5M[ihir] Bellare、R[an] Canetti 和 H[ugo] Krawczyk。用于消息认证的密钥哈希函数。第 16 届国际密码学大会论文集(1996 年 8 月),第 1-15 页。(另请参阅 H. Krawczyk、M. Bellare 和 R. Canetti,HMAC:用于消息认证的密钥哈希函数,征求意见 RFC 2104,互联网工程任务组(1997 年 2 月)。
11.5.5 M[ihir] Bellare, R[an] Canetti, and H[ugo] Krawczyk. Keying hash functions for message authentication. Proceedings of the 16th International Cryptography Conference (August 1996), pages 1–15. (Also see H. Krawczyk, M. Bellare, and R. Canetti, HMAC: Keyed-hashing for message authentication, Request For Comments RFC 2104, Internet Engineering Task Force (February 1997).
本文和 RFC 介绍并定义了 HMAC,一种在广泛部署的协议中使用的哈希函数。
This paper and the RFC introduce and define HMAC, a hash function used in widely deployed protocols.
11.5.6David Chaum。无法追踪的电子邮件、回信地址和数字假名。《ACM 通讯》24,2(1981 年 2 月),第 84-88 页。
11.5.6 David Chaum. Untraceable electronic mail, return addresses, and digital pseudonyms. Communications of the ACM 24, 2 (February 1981), pages 84–88.
本文介绍了一种名为 mixnet 的系统设计,它允许消息发送者向接收者隐藏其真实身份,但仍允许接收者做出响应。
This paper introduces a system design, named mixnet, that allows a sender of a message to hide its true identity from a receiver but still allow the receiver to respond.
第 11.11 节的战争故事给出了大量关于对手如何破坏系统安全性的例子。本节列出了几篇论文,它们提供了更长更详细的攻击描述。这是一个快速发展的领域;设计师刚抵挡住新的攻击,对手就会试图寻找新的攻击。这场军备竞赛反映在以下一些阅读材料中,尽管所描述的一些攻击已经失效(或随着时间的推移将失效),但这些论文提供了宝贵的见解。Usenix Security和Computer and Communication Security的会议记录中经常包含解释当前攻击的论文,而所谓的“黑帽”社区举办的会议记录了黑暗面的“进展”。
Section 11.11 on war stories gives a wide range of examples of how adversaries can break a system’s security. This section lists a few papers that provide a longer and more detailed descriptions of attacks. This is a fast-moving area; as soon as designers fend off new attacks, adversaries try to find new attacks. This arms race is reflected in some of the following readings, and although some of the attacks described have become ineffective (or will over time), these papers provide valuable insights. The proceedings of Usenix Security and Computer and Communication Security often contain papers explaining current attacks, and conferences run by the so-called “black hat” community document the “progress” on the dark side.
11.6.1Eugene Spafford,《危机与后果》。《ACM 通讯》32,6(1989 年 6 月),第 678-687 页。本文记录了 Morris 蠕虫的工作原理。它是最早的蠕虫之一,也是最复杂的蠕虫之一。
11.6.1 Eugene Spafford, Crisis and aftermath. Communications of the ACM 32, 6 (June 1989), pages 678–687.This paper documents how the Morris worm works. It was one of the first worms, as well as one of the most sophisticated.
11.6.2Jonathan Pincus 和 Brandon Baker。超越堆栈破坏:利用缓冲区溢出的最新进展。IEEE安全与隐私2、4(2004 年 8 月),第 20-27 页。
11.6.2 Jonathan Pincus and Brandon Baker. Beyond stack smashing: Recent advances in exploiting buffer overruns. IEEE Security and Privacy 2, 4 (August 2004), pages 20–27.
本文介绍了自 Morris 蠕虫以来缓冲区溢出攻击是如何演变的。
This paper describes how buffer overrun attacks have evolved since the Morris worm.
11.6.3Abhishek Kumar、Vern Paxson 和 Nicholas Weaver。利用底层结构详细重建互联网规模事件。ACM互联网测量会议论文集(2005 年 10 月),第 351-364 页。
11.6.3 Abhishek Kumar, Vern Paxson, and Nicholas Weaver. Exploiting underlying structure for detailed reconstruction of an Internet scale event. Proceedings of the ACM Internet Measurement Conference (October 2005), pages 351–364.
本文介绍了 Witty 蠕虫以及作者如何追踪其来源。该论文包含许多有趣的信息。
This paper describes the Witty worm and how the authors were able to track down its source. The work contains many interesting nuggets of information.
11.6.4Vern Paxson。使用反射器进行分布式拒绝服务攻击的分析。《计算机通信评论》31,3(2001 年 7 月),第 38-47 页。
11.6.4 Vern Paxson. An analysis of using reflectors for distributed denial-of-service attacks. Computer Communications Review 31, 3 (July 2001), pages 38–47.
本文介绍了攻击者如何诱骗大量互联网服务器将其组合的回复发送给受害者,并以此方式对受害者发起拒绝服务攻击。本文推测了防御此类攻击的几个可能方向。
This paper describes how an adversary can trick a large set of Internet servers to send their combined replies to a victim and in that way launch a denial-of-service attack on the victim. It speculates on several possible directions for defending against such attacks.
11.6.5Chris Kanich 等。Spamalytics:垃圾邮件营销转化的实证分析。ACM计算机和通信安全 (CCS) 会议论文集(2008 年 10 月),第 3-14 页。
11.6.5 Chris Kanich et al. Spamalytics: An empirical analysis of spam marketing conversion. Proceedings of the ACM Conference on Computer and Communications Security (CCS) (October 2008), pages 3–14.
本文介绍了垃圾邮件发送者用来发送未经请求的电子邮件的基础设施,并试图确定垃圾邮件发送者的财务奖励制度。本文有其不足之处,但它是少数几篇试图了解垃圾邮件背后经济因素的论文之一。
This paper describes the infrastructure that spammers use to send unsolicited e-mail and tries to establish what the financial reward system is for spammers. This paper has its shortcomings, but it is one of the few papers that tries to understand the economics behind spam.
11.6.6Tom Jagatic、Nathaniel Johnson、Markus Jakobsson 和 Filippo Menczer。社交网络钓鱼。ACM 50 通讯,10(2007 年 10 月),第 94-100 页。
11.6.6 Tom Jagatic, Nathaniel Johnson, Markus Jakobsson, and Filippo Menczer. Social phishing. Communications of the ACM 50, 10 (October 2007), pages 94–100.
本研究调查了单个网络钓鱼攻击的成功率。
This study investigates the success rate of individual phishing attacks.
3 Jill’s File System for Dummies
4 EZ-Park
5古姆布尔
5 Goomble
6课程交换
7 Banking on Local Remote Procedure Call
8比特骗子
9本的内核
10 A Picokernel-Based Stock-Ticker System
12 A Bounded Buffer with Semaphores
十三单片机数控
14 Toastac-25
15 BOOZE: Ben’s Object-Oriented Zoned Environment
20 Gnutella: Peer-to-Peer Networking
31 The Bank of Central Peoria, Limited
33 ANTS: Advanced “Nonce-ensical” Transaction System
35 Alice’s Reliable Block Store
36 Establishing Serializability
四十五WebTrust.com (OutOfMoney.com,第二部分)
这些问题集旨在让学生仔细思考如何将课本中的概念应用于新问题。这些问题源自多年来教授本教科书内容时进行的考试。许多问题都是多项选择题,有多个正确答案。读者应尝试找出所有正确选项。
These problem sets seek to make the student think carefully about how to apply the concepts of the text to new problems. These problems are derived from examinations given over the years while teaching the material in this textbook. Many of the problems are multiple choice with several right answers.The reader should try to identify all right options.
一些重要且有趣的系统概念在正文中没有提及,因此初读时似乎感觉书中没有这些概念,但这些概念实际上可以在练习和问题集中找到。这些概念的定义和讨论可以在它们出现的练习或问题集的文本中找到。以下是练习和问题集介绍的概念列表:
Some significant and interesting system concepts that are not mentioned in the main text, and therefore at first read seem to be missing from the book, are actually to be found within the exercises and problem sets. Definitions and discussion of these concepts can be found in the text of the exercise or problem set in which they appear. Here is a list of concepts that the exercises and problem sets introduce:
ad hoc wireless network (Problem sets 19 and 21)
bang-bang protocol (Exercise 7.13)
blast protocol (Exercise 7.25)
commutative cryptographic transformation (Exercise 11.4)
condition variable (Problem set 13)
consistent hashing (Problem set 23)
convergent encryption (Problem set 48)
delayed authentication (Exercise 11.9)
delegation forwarding (Exercise 2.1)
event variable (Problem set 11)
follow-me forwarding (Exercise 2.1)
Information Management System atomicity (Exercise 9.5)
lightweight remote procedure call (Problem set 7)
multiple register set processor (Problem set 9)
object-oriented virtual memory (Problem set 15)
overlay network (Problem set 20)
peer-to-peer network (Problem set 20)
RAID 5, with rotating parity (Exercise 8.8)
restartable atomic region (Problem set 9)
self-describing storage (Exercise 6.8)
第 7 章及以上的练习位于在线章节中,第 17 章及以上的问题集位于在线问题集书中。
Exercises for Chapter 7 and above are in on-line chapters, and problem sets numbered 17 and higher are in the on-line book of problem sets.
其中一些问题集涵盖了几个不同章节的主题。每组问题的开头都有一个括号,指出了它涉及的主要章节。每个练习或问题集问题后面都有一个标识符,格式为“ 1978-3-14 ”。此标识符报告了该问题某个版本首次出现的考试的年份、考试编号和问题编号。对于那些不是由作者之一开发的问题集,会在问题集第一页的脚注中显示致谢行。
Some of these problem sets span the topics of several different chapters. A parenthetical note at the beginning of each set indicates the primary chapters that it involves. Following each exercise or problem set question is an identifier of the form “1978–3–14”. This identifier reports the year, examination number, and problem number of the examination in which some version of that problem first appeared. For those problem sets not developed by one of the authors, a credit line appears in a footnote on the first page of the problem set.
(第2章)
由于 Ben Bitdiddle 在之前的考试中犯下了许多错误,他被指派永远维护一台运行UNIX ®操作系统版本 7 的 PDP-11。最近,他的一个用户的数据库应用程序在达到 1,082,201,088 字节(大约 1 GB)的文件大小限制后失败了。为了解决这个问题,他用旧的 4 GB(2 32字节)驱动器升级了计算机;磁盘控制器硬件支持 32 位扇区地址,可以寻址最大 2 TB 的磁盘。不幸的是,安装新磁盘后,Ben 失望地发现文件大小限制没有变化。
For his many past sins on previous exams, Ben Bitdiddle is assigned to spend eternity maintaining a PDP-11 running version 7 of the UNIX ® operating system. Recently, one of his user’s database applications failed after reaching the file size limit of 1,082,201,088 bytes (approximately 1 gigabyte). In an effort to solve the problem, he upgraded the computer with an old 4-gigabyte (232 byte) drive; the disk controller hardware supports 32-bit sector addresses, and can address disks up to 2 terabytes in size. Unfortunately, Ben is disappointed to find the file size limit unchanged after installing the new disk.
在这个问题中,术语块号指的是存储在 inode 中的块指针。一个块有 512 个字节。另外,Ben 的UNIX版本 7系统有一个从第 2.5 节中描述的文件系统扩展而来的文件系统:它的 inode 被设计用于支持更大的磁盘。每个 inode 包含 13 个块号,每个块号 4 字节;前 10 个块号指向文件的前 10 个块,其余 3 个用于文件的其余部分。第 11 个块号指向一个间接块,包含 128 个块号,第 12 个块号指向一个双间接块,包含 128 个间接块号,第 13 个块号指向一个三重间接块,包含 128 个双间接块号。最后,inode 包含一个四字节的文件大小字段。
In this question, the term block number refers to the block pointers stored in inodes. There are 512 bytes in a block. In addition, Ben’s version 7 UNIX system has a file system that has been expanded from the one described in Section 2.5: its inodes are designed to support larger disks. Each inode contains 13 block numbers of 4 bytes each; the first 10 block numbers point to the first 10 blocks of the file, and the remaining 3 are used for the rest of the file. The 11th block number points to an indirect block, containing 128 block numbers, the 12th block number points to a double-indirect block, containing 128 indirect block numbers, and the 13th block number points to a triple-indirect block, containing 128 double-indirect block numbers. Finally, the inode contains a four-byte file size field.
Q 1.1以下哪项调整将允许存储大于当前 1 GB 限制的文件?
Q 1.1 Which of the following adjustments will allow files larger than the current 1-gigabyte limit to be stored?
A。仅将 inode 中的文件大小字段从 32 位增加到 64 位值。
A. Increase just the file size field in the inode from a 32-bit to a 64-bit value.
B.将每个块的字节数从 512 字节增加到 2048 字节。
B. Increase just the number of bytes per block from 512 to 2048 bytes.
C。重新格式化磁盘以增加 inode 表中分配的 inode 数量。
C. Reformat the disk to increase the number of inodes allocated in the inode table.
D.将每个 inode 中的一个直接块号替换为一个额外的三重间接块号。
D. Replace one of the direct block numbers in each inode with an additional triple-indirect block number.
Ben 观察到,每个 inode 中分配给块号的字节为 52 个(每个 4 字节的 13 个块号),每个间接块中分配给块号的字节为 512 个(每个 4 字节的 128 个块号)。他认为,他可以保持分配给块号的总空间不变,但更改每个块号的大小,以增加支持的最大文件大小。虽然 inode 和间接块中的块号数量会发生变化,但 Ben 在每个 inode 中只保留一个间接、一个双重间接和一个三重间接块号。
Ben observes that there are 52 bytes allocated to block numbers in each inode (13 block numbers at 4 bytes each), and 512 bytes allocated to block numbers in each indirect block (128 block numbers at 4 bytes each). He figures that he can keep the total space allocated to block numbers the same, but change the size of each block number, to increase the maximum supported file size. While the number of block numbers in inodes and indirect blocks will change, Ben keeps exactly one indirect, one double-indirect and one triple-indirect block number in each inode.
Q 1.2以下哪项调整(不做上一个问题中的任何修改)将允许存储大于当前大约 1 GB 限制的文件?
Q 1.2 Which of the following adjustments (without any of the modifications in the previous question), will allow files larger than the current approximately 1-gigabyte limit to be stored?
(第 4 章)
Ben 负责 Stickr 的系统设计,Stickr 是一个新网站,用于发布保险杠贴纸的图片并对其进行标记。幸运的是,Alyssa 最近实施了一个三元组存储系统 (TSS),该系统根据以下规范存储和检索任意形式的三元组{主体、关系、客体}:
Ben is in charge of system design for Stickr, a new Web site for posting pictures of bumper stickers and tagging them. Luckily for him, Alyssa had recently implemented a Triplet Storage System (TSS), which stores and retrieves arbitrary triples of the form {subject, relationship, object} according to the following specification:
程序 FIND (主题,关系,对象,开始,计数)
procedure FIND (subject, relationship, object, start, count)
// 返回OK + 匹配三元组数组
// returns OK + array of matching triples
程序 INSERT (主题,关系,对象)
procedure INSERT (subject, relationship, object)
// 如果三元组尚不存在,则将其添加到 TSS 并返回 OK
// adds the triple to the TSS if it is not already there and returns OK
程序 DELETE (主题,关系,对象)
procedure DELETE (subject, relationship, object)
// 如果三元组存在,则删除它,返回 TRUE,否则返回 FALSE
// removes the triple if it exists, returning TRUE, FALSE otherwise
Ben 提出了以下设计:
Ben comes up with the following design:
如图所示,Ben 使用 RPC 接口允许 Web 服务器与三元组存储系统交互。Ben 选择至少一次的RPC 语义。假设三元组存储系统永远不会崩溃,但 Web 服务器和三元组存储系统之间的网络不可靠,可能会丢失消息。
As shown in the figure, Ben uses an RPC interface to allow the Web server to interact with the triplet storage system. Ben chooses at-least-once RPC semantics. Assume that the triplet storage system never crashes, but the network between the Web server and triplet storage system is unreliable and may drop messages.
Q 2.1假设 Ben 的 Web 服务器上只有一个线程使用三元组存储系统,并且该线程每次只发出一个 RPC。Web 服务器会观察到哪些类型的不正确行为?
Q 2.1 Suppose that only a single thread on Ben’s Web server is using the triplet storage system and that this thread issues just one RPC at a time. What types of incorrect behavior can the Web server observe?
A。即使三元组存储系统中存在匹配的三元组,Web 服务器上的FIND RPC存根有时也不会返回任何结果。
A. The FIND RPC stub on the Web server sometimes returns no results, even though matching triples exist in the triplet storage system.
B.Web 服务器上的INSERT RPC存根有时会返回OK,而不会将三元组插入存储系统。
B. The INSERT RPC stub on the Web server sometimes returns OK without inserting the triple into the storage system.
C。当Web 服务器上的DELETE RPC存根实际删除了三元组时,它有时会返回FALSE 。
C. The DELETE RPC stub on the Web server sometimes returns FALSE when it actually deleted a triple.
D.Web 服务器上的FIND RPC存根有时会返回已被删除的三元组。
D. The FIND RPC stub on the Web server sometimes returns triples that have been deleted.
Q 2.2假设 Ben 切换到最多一次RPC;如果一段时间后未收到任何回复,则 Web 服务器上的 RPC 存根将放弃并返回“计时器已过期”错误代码。再次假设 Ben 的 Web 服务器上只有一个线程正在使用三重存储系统,并且该线程每次只发出一个 RPC。Web 服务器可以观察到哪些类型的错误行为?
Q 2.2 Suppose Ben switches to at-most-once RPC; if no reply is received after some time, the RPC stub on the Web server gives up and returns a “timer expired” error code. Assume again that only a single thread on Ben’s Web server is using the triple storage system and that this thread issues just one RPC at a time. What types of incorrect behavior can the Web server observe?
A。假设它没有超时,当存储系统中存在匹配的三元组时,Web 服务器上的FIND RPC 存根有时不会返回任何结果。
A. Assuming it does not time out, the FIND RPC stub on the Web server can sometimes return no results when matching triples exist in the storage system.
B.假设它没有超时,Web 服务器上的INSERT RPC 存根有时可以返回OK,而无需将三元组插入存储系统。
B. Assuming it does not time out, the INSERT RPC stub on the Web server can sometimes return OK without inserting the triple into the storage system.
C。假设它没有超时,当它实际删除三元组时,Web 服务器上的DELETE RPC 存根有时会返回FALSE 。
C. Assuming it does not time out, the DELETE RPC stub on the Web server can sometimes return FALSE when it actually deleted a triple.
D.假设它没有超时,Web 服务器上的FIND RPC 存根有时可以返回已被删除的三元组。
D. Assuming it does not time out, the FIND RPC stub on the Web server can sometimes return triples that have been deleted.
(第 4 章)
由于对 NFS 的复杂性感到困惑,Moon Microsystems 专家 Jill Boy 决定实现一种简单的替代方案,她称之为“傻瓜文件系统”(FSD)。她将 FSD 分为两部分实现:
Mystified by the complexity of NFS, Moon Microsystems guru Jill Boy decides to implement a simple alternative she calls File System for Dummies, or FSD. She implements FSD in two pieces:
1.FSD 服务器,作为一个简单的用户应用程序实现,用于响应 FSD 请求。每个请求都与一个UNIX文件系统调用(例如READ、WRITE、OPEN、CLOSE或CREATE)完全对应,并仅返回该调用返回的信息(状态、整数文件描述符、数据等)。
1. An FSD server, implemented as a simple user application, which responds to FSD requests. Each request corresponds exactly to a UNIX file system call (e.g., READ, WRITE, OPEN, CLOSE, or CREATE) and returns just the information returned by that call (status, integer file descriptor, data, etc.).
2.FSD 客户端库,可以与各种应用程序链接在一起,以替代 Jill 的 FSD 文件系统调用实现,例如OPEN、READ和WRITE,以取代其UNIX对应版本。为了避免混淆,我们将这些过程的 Jill FSD 版本称为FSD_OPEN,依此类推。
2. An FSD client library, which can be linked together with various applications to substitute Jill’s FSD implementations of file system calls like OPEN, READ, and WRITE for their UNIX counterparts. To avoid confusion, let’s refer to Jill’s FSD versions of these procedures as FSD_OPEN, and so on.
Jill 的客户端库使用标准UNIX调用来访问本地文件,但使用以下形式的名称
Jill’s client library uses the standard UNIX calls to access local files but uses names of the form
/fsd/主机名/apath
/fsd/hostname/apath
引用主机名为hostname的主机上绝对路径名为/apath的文件。她的库程序可以识别涉及远程文件的操作,例如,
to refer to the file whose absolute path name is /apath on the host named hostname. Her library procedures recognize operations involving remote files e.g.,
FSD_OPEN (“/fsd/cse.pedantic.edu/foobar”, READ_ONLY )
FSD_OPEN (“/fsd/cse.pedantic.edu/foobar”, READ_ONLY )
并将它们转换为对相应主机的 RPC 请求,使用该主机上的文件名,例如,
and translates them to RPC requests to the appropriate host, using the file name on that host e.g.,
RPC (“/fsd/cse.pedantic.edu/foobar”,“ OPEN ”,“/foobar”, READ_ONLY )。
RPC (“/fsd/cse.pedantic.edu/foobar”, “ OPEN ”, “/foobar”, READ_ONLY ).
RPC 调用会引起相应的UNIX调用,例如,
The RPC call causes the corresponding UNIX call e.g.,
打开 (“/foobar”, READ_ONLY )
OPEN (“/foobar”, READ_ONLY )
在远程主机上执行,并将结果(例如文件描述符)作为 RPC 调用的结果返回。Jill 的服务器代码会捕获每个请求处理过程中的错误,并在出现远程错误时从 RPC 调用返回ERROR 。
to be executed on the remote host and the results (e.g., a file descriptor) to be returned as the result of the RPC call. Jill’s server code catches errors in the processing of each request and returns ERROR from the RPC call on remote errors.
图 1描述了 Jill 的 FSD 客户端库版本 1 的伪代码。代码中的 RPC 调用使用一次性语义将简单的 RPC 命令中继到服务器。请注意,服务器或客户端库均未执行任何数据缓存。
Figure 1 describes pseudocode for Version 1 of Jill’s FSD client library. The RPC calls in the code relay simple RPC commands to the server, using exactly-once semantics. Note that no data caching is done by either the server or the client library.
图 1 FSD 客户端库版本 1 的伪代码
Figure 1 Pseudocode for FSD client library, Version 1
Q 3.1上述代码通过主机句柄表条目中的空字符串(“”)表示什么?
Q 3.1 What does the above code indicate via an empty string (“”) in an entry of handle to host table?
A. An unused entry of the table.
B. An open file on the client host machine.
C. An end-of-file condition on an open file.
吉尔的实习生 Mini Malcode 建议简化上述代码,方法是删除 handle_to_rhandle_table ,并简单地返回远程或本地机器上OPEN返回的未翻译句柄。Mini 实现了她的简化客户端库,对每个 FSD 调用进行了适当的更改,并在几个测试程序上进行了尝试。
Mini Malcode, an intern assigned to Jill, proposes that the above code be simplified by eliminating the handle_to_rhandle_table and simply returning the untranslated handles returned by OPEN on the remote or local machines. Mini implements her simplified client library, making appropriate changes to each FSD call, and tries it on several test programs.
Q 3.2下列哪些测试程序在 Mini 简化后仍可继续工作?
Q 3.2 Which of the following test programs will continue to work after Mini’s simplification?
A. A program that reads a single, local file.
B. A program that reads a single remote file.
C. A program that reads and writes many local files.
D. A program that reads and writes several files from a single remote FSD server.
E. A program that reads many files from different remote FSD servers.
F。读取多个本地文件以及来自单个远程 FSD 服务器的多个文件的程序。
F. A program that reads several local files as well as several files from a single remote FSD server.
Jill 拒绝了 Mini 的建议,坚持使用上面显示的版本 1 代码。市场部请她对 FSD 和 NFS 进行比较(参见第 4.5 节)。
Jill rejects Mini’s suggestions, insisting on the Version 1 code shown above. Marketing asks her for a comparison between FSD and NFS (see Section 4.5).
Q 3.3完成下表,比较 NFS 和 FSD,在 NFS 和 FSD 下方分别圈出是或否:
Q 3.3 Complete the following table comparing NFS to FSD by circling yes or no under each of NFS and FSD for each statement:
| 陈述 | 网络文件系统 (NFS) | 消防处 |
| 远程句柄包括 inode 编号 | 是/否 | 是/否 |
| 读取和写入调用是幂等的 | 是/否 | 是/否 |
| 删除后可以继续读取打开的文件(例如,通过远程主机上的程序) | 是/否 | 是/否 |
| 使用前需要安装远程文件系统 | 是/否 | 是/否 |
Moon 的网络专家认为,一个更简单的 RPC 包承诺至少一次而不是精确一次语义,可以节省资金,因此 Jill 替换了更简单的 RPC 框架并进行了尝试。尽管新的(版本 2)FSD 在大多数情况下都能正常工作,但 Jill 发现 FSD_READ有时会返回错误的数据;她请您帮忙。您将问题追溯到服务器多次执行单个 RPC 请求,并正在考虑
Convinced by Moon’s networking experts that a much simpler RPC package promising at-least-once rather than exactly-once semantics will save money, Jill substitutes the simpler RPC framework and tries it out. Although the new (Version 2) FSD works most of the time, Jill finds that an FSD_READ sometimes returns the wrong data; she asks you to help. You trace the problem to multiple executions of a single RPC request by the server and are considering
客户端上的响应缓存,足以检测相同的请求并返回重复的缓存结果,而无需将请求重新发送到服务器。
A response cache on the client, sufficient to detect identical requests and returning a cached result for duplicates without resending the request to the server.
服务器上的响应缓存,足以检测相同的请求并返回重复的缓存结果而无需重新执行它们。
A response cache on the server, sufficient to detect identical requests and returning a cached result for duplicates without reexecuting them.
向每个 RPC 请求添加一个单调递增的序列号(nonce),使得原本相同的请求变得有区别。
A monotonically increasing sequence number (nonce) added to each RPC request, making otherwise identical requests distinct.
Q 3.4您建议进行下列哪些更改来解决至少一次RPC 语义引入的问题?
Q 3.4 Which of the following changes would you suggest to address the problem introduced by the at-least-once RPC semantics?
(Chapter 5 in Chapter 4 setting)
在 Pedantic University 找停车位是件非常困难的事。Ben Bitdiddle 认为,利用一点技术手段可以解决问题,于是着手设计 EZ-Park 客户端/服务器系统。他在宿舍里弄了一台机器来运行 EZ-Park 服务器。他设法说服 Pedantic University 停车场为每辆车配备一台运行 EZ-Park 客户端软件的微型计算机。EZ-Park 客户端使用远程过程调用 (RPC) 与服务器通信。客户端向 Ben 的服务器发出请求,请求查找可用停车位(当汽车驾驶员正在寻找停车位时)和放弃停车位(当汽车驾驶员离开停车位时)。当且仅当 EZ-Park 为汽车驾驶员分配停车位时,汽车驾驶员才会使用该停车位。
Finding a parking spot at Pedantic University is as hard as it gets. Ben Bitdiddle, deciding that a little technology can help, sets about to design the EZ-Park client/server system. He gets a machine to run an EZ-Park server in his dorm room. He manages to convince Pedantic University parking to equip each car with a tiny computer running EZ-Park client software. EZ-Park clients communicate with the server using remote procedure calls (RPCs). A client makes requests to Ben’s server both to find an available spot (when the car’s driver is looking for one) and to relinquish a spot (when the car’s driver is leaving a spot). A car driver uses a parking spot if, and only if, EZ-Park allocates it to him or her.
在 Ben 的初始设计中,服务器软件在一个地址空间中运行,并为每个客户端请求生成一个新线程。服务器有两个过程:FIND_SPOT () 和RELINQUISH_SPOT ()。每个线程都是响应客户端发送的相应 RPC 请求而生成的。服务器线程使用一个共享数组available [],其大小为NSPOTS(停车位总数)。如果停车位j空闲,则available [ j ] 设置为TRUE ,否则设置为 FALSE ;它被初始化为TRUE,并且一开始就没有停放的汽车。NSPOTS个停车位的编号从 0 到NSPOTS -1。numcars是一个全局变量,用于计算停放的汽车总数;它被初始化为 0。
In Ben’s initial design, the server software runs in one address space and spawns a new thread for each client request. The server has two procedures: FIND_SPOT ( ) and RELINQUISH_SPOT ( ). Each of these threads is spawned in response to the corresponding RPC request sent by a client. The server threads use a shared array, available[], of size NSPOTS (the total number of parking spots). available [j] is set to TRUE if spot j is free, and FALSE otherwise; it is initialized to TRUE, and there are no cars parked to begin with. The NSPOTS parking spots are numbered from 0 through NSPOTS − 1. numcars is a global variable that counts the total number of cars parked; it is initialized to 0.
Ben 实现了以下伪代码在服务器上运行。每个FIND_SPOT () 线程都会进入一个while循环,该循环仅在汽车被分配到停车位时终止:
Ben implements the following pseudocode to run on the server. Each FIND_SPOT ( ) thread enters a while loop that terminates only when the car is allocated a spot:
1 程序 FIND_SPOT ( ) // 当客户车到达时调用
1 procedure FIND_SPOT ( ) // Called when a client car arrives
2 而 TRUE
2 while TRUE do
3 for i ← 0到 NSPOTS 执行
3 for i ← 0 to NSPOTS do
4 如果 可用[ i ] = TRUE 则
4 if available[i] = TRUE then
5 可用[ i ] ← FALSE
5 available[i] ← FALSE
6 辆车←辆车+ 1
6 numcars ← numcars + 1
7 return i //客户端获取现货i
7 return i // Client gets spot i
8程序 RELINQUISH_SPOT ( spot ) // 当客户车离开时调用
8 procedure RELINQUISH_SPOT (spot) // Called when a client car leaves
9可用[现货] ← TRUE
9 available[spot] ← TRUE
10车辆数量←车辆数量− 1
10 numcars ← numcars − 1
Ben 希望他的服务器的正确行为(“正确性规范”)如下:
Ben’s intended correct behavior for his server (the “correctness specification”) is as follows:
A。FIND_SPOT ( ) 在 [0, …, NSPOTS − 1]范围内分配任意停车位,每次最多分配给一辆汽车,即使多辆汽车同时向服务器发送停车位请求。
A. FIND_SPOT ( ) allocates any given spot in [0, …, NSPOTS − 1] to at most one car at a time, even when cars are concurrently sending requests to the server requesting spots.
B. numcars must correctly maintain the number of parked cars.
C。如果在任何时候 (1) 都有停车位,并且将来不会有停放的汽车离开,(2) 没有未完成的FIND_SPOT () 请求,并且 (3) 只有一个客户端发出FIND_SPOT请求,那么该客户端就应该获得一个停车位。
C. If at any time (1) spots are available and no parked car ever leaves in the future, (2) there are no outstanding FIND_SPOT ( ) requests, and (3) exactly one client makes a FIND_SPOT request, then the client should get a spot.
Ben 运行服务器,发现当没有并发请求时,EZ-Park 可以正常工作。但是,当他部署系统时,他发现有时多辆车被分配到同一个位置,导致碰撞!当有并发请求时,他的系统不符合正确性规范。
Ben runs the server and finds that when there are no concurrent requests, EZ-Park works correctly. However, when he deploys the system, he finds that sometimes multiple cars are assigned the same spot, leading to collisions! His system does not meet the correctness specification when there are concurrent requests.
做出以下假设:
Make the following assumptions:
1.更新numcars的语句不是原子的;每个语句都涉及多条指令。
1. The statements to update numcars are not atomic; each involves multiple instructions.
2. The server runs on a single processor with a preemptive thread scheduler.
3.网络可靠地传递 RPC 消息,并且不会出现网络、服务器或客户端故障。
3. The network delivers RPC messages reliably, and there are no network, server, or client failures.
4. Cars arrive and leave at random.
5.ACQUIRE和RELEASE 的定义见第 5 章。
5. ACQUIRE and RELEASE are as defined in chapter 5.
Q 4.1 Which of these statements is true about the problems with Ben’s design?
A。对available []的访问存在竞争条件,当两个FIND_SPOT ( ) 线程运行时,这可能会违反正确性规范之一。
A. There is a race condition in accesses to available[], which may violate one of the correctness specifications when two FIND_SPOT ( ) threads run.
B.对available []的访问存在竞争条件,当一个FIND_SPOT ( ) 线程和一个RELINQUISH_SPOT ( ) 线程运行时,这可能会违反正确性规范 A。
B. There is a race condition in accesses to available[], which may violate correctness specification A when one FIND_SPOT ( ) thread and one RELINQUISH_SPOT ( ) thread runs.
C。在访问numcars时存在竞争条件,当多个线程更新numcars时,这可能会违反某项正确性规范。
C. There is a race condition in accesses to numcars, which may violate one of the correctness specifications when more than one thread updates numcars.
D.只要客户端请求查找位置的平均时间大于请求的平均处理延迟,就不会出现竞争条件。
D. There is no race condition as long as the average time between client requests to find a spot is larger than the average processing delay for a request.
Ben 请求 Alyssa 帮助修复服务器问题,Alyssa 告诉他需要设置一些锁。她建议添加对ACQUIRE和RELEASE 的调用,如下所示:
Ben enlists Alyssa’s help to fix the problem with his server, and she tells him that he needs to set some locks. She suggests adding calls to ACQUIRE and RELEASE as follows:
1 程序 FIND_SPOT ( ) // 当客户车需要停车位时调用
1 procedure FIND_SPOT ( ) // Called when a client car wants a spot
2 而 TRUE
2 while TRUE do
!→获取 (avail_lock)
!→ ACQUIRE (avail_lock)
3 for i ← 0到 NSPOTS 执行
3 for i ← 0 to NSPOTS do
4 如果 可用[ i ] = TRUE 则
4 if available[i] = TRUE then
5 可用[ i ] ← FALSE
5 available[i] ← FALSE
6 辆车←辆车+ 1
6 numcars ← numcars + 1
!→发布 ( avail_lock )
!→ RELEASE (avail_lock)
7 return i // 将点i分配给该客户端
7 return i // Allocate spot i to this client
!→发布 ( avail_lock )
!→ RELEASE (avail_lock)
8 程序 RELINQUISH_SPOT ( spot ) // 当客户车辆离开现场时调用
8 procedure RELINQUISH_SPOT (spot) // Called when a client car is leaving spot
!→获取 (avail_lock)
!→ ACQUIRE (avail_lock)
9 可用[现货] ← TRUE
9 available[spot] ← TRUE
10 车辆数量←车辆数量− 1
10 numcars ← numcars − 1
!→发布 ( avail_lock )
!→ RELEASE (avail_lock)
Q 4.2 Alyssa 的代码解决了问题吗?为什么或为什么不?
Q 4.2 Does Alyssa’s code solve the problem? Why or why not?
问 4.3 Ben 看不出Alyssa 在第 7行之后放置RELEASE ( avail_lock ) 的任何合理理由,因此他将其删除。该程序是否仍符合其规范?为什么或为什么不?
Q 4.3 Ben can’t see any good reason for the RELEASE (avail_lock) that Alyssa placed after line 7, so he removes it. Does the program still meet its specifications? Why or why not?
为了减少对avail_lock的竞争,Ben 将程序重写如下:
Hoping to reduce competition for avail_lock, Ben rewrites the program as follows:
1 程序 FIND_SPOT ( ) // 当客户车需要停车位时调用
1 procedure FIND_SPOT ( ) // Called when a client car wants a spot
2 而 TRUE
2 while TRUE do
3 for i ← 0到NSPOTS执行
3 for i ← 0 to NSPOTS do
!→获取(avail_lock)
!→ ACQUIRE (avail_lock)
4 如果 可用[ i ] = TRUE 则
4 if available[i] = TRUE then
5 可用[ i ] ← FALSE
5 available[i] ← FALSE
6 辆车←辆车+ 1
6 numcars ← numcars + 1
!→发布 ( avail_lock )
!→ RELEASE (avail_lock)
7 return i // 将点i分配给该客户端
7 return i // Allocate spot i to this client
!→否则 释放 (avail_lock)
!→ else RELEASE (avail_lock)
8 程序 RELINQUISH_SPOT ( spot ) // 当客户车辆离开现场时调用
8 procedure RELINQUISH_SPOT (spot) // Called when a client car is leaving spot
!→获取 (avail_lock)
!→ ACQUIRE (avail_lock)
9 可用[现货] ← TRUE
9 available[spot] ← TRUE
10 车辆数量←车辆数量− 1
10 numcars ← numcars − 1
!→发布 ( avail_lock )
!→ RELEASE (avail_lock)
Q 4.4 Does that program meet the specifications?
现在 Ben 觉得他对锁有了更好的理解,他又尝试了一次,希望通过缩短代码能够真正加快速度:
Now that Ben feels he understands locks better, he tries one more time, hoping that by shortening the code he can really speed things up:
1 程序 FIND_SPOT ( ) // 当客户车需要停车位时调用
1 procedure FIND_SPOT ( ) // Called when a client car wants a spot
2 而 TRUE
2 while TRUE do
!→获取 (avail_lock)
!→ ACQUIRE (avail_lock)
3 for i ← 0到 NSPOTS 执行
3 for i ← 0 to NSPOTS do
4 如果 可用[ i ] = TRUE 则
4 if available[i] = TRUE then
5 可用[ i ] ← FALSE
5 available[i] ← FALSE
6 辆车←辆车+ 1
6 numcars ← numcars + 1
7 return i // 将点i分配给该客户端
7 return i // Allocate spot i to this client
8 程序 RELINQUISH_SPOT ( spot ) // 当客户车辆离开现场时调用
8 procedure RELINQUISH_SPOT (spot) // Called when a client car is leaving spot
9 可用[现货] ← TRUE
9 available[spot] ← TRUE
10 车辆数量←车辆数量− 1
10 numcars ← numcars − 1
!→发布 ( avail_lock )
!→ RELEASE (avail_lock)
Q 4.5 Does Ben’s slimmed-down program meet the specifications?
Ben 现在决定解决真正拥挤的地方的停车问题:Pedantic 的体育场,那里总是有车在寻找停车位!他在足球赛季的第一场主场比赛期间更新了NSPOTS并部署了该系统。许多客户抱怨他的服务器速度慢或没有响应。
Ben now decides to combat parking at a truly crowded location: Pedantic’s stadium, where there are always cars looking for spots! He updates NSPOTS and deploys the system during the first home game of the football season. Many clients complain that his server is slow or unresponsive.
Q 4.6如果在停车场已满时客户端调用FIND_SPOT () RPC,假设有多辆汽车可能发出请求,它多快能得到响应?
Q 4.6 If a client invokes the FIND_SPOT ( ) RPC when the parking lot is full, how quickly will it get a response, assuming that multiple cars may be making requests?
A. The client will not get a response until at least one car relinquishes a spot.
B. The client may never get a response even when other cars relinquish their spots.
Alyssa 告诉 Ben 在他的 RPC 系统中添加一个客户端计时器,如果服务器在 4 秒内没有响应,该计时器就会过期。计时器过期后,汽车驾驶员可以重试请求,或者选择离开体育场去电视上观看比赛。Alyssa 警告 Ben,这一变化可能会导致系统违反正确性规范。
Alyssa tells Ben to add a client-side timer to his RPC system that expires if the server does not respond within 4 seconds. Upon a timer expiration, the car’s driver may retry the request, or instead choose to leave the stadium to watch the game on TV. Alyssa warns Ben that this change may cause the system to violate the correctness specification.
Q 4.7当 Ben 将计时器添加到他的客户端时,他发现了一些意外情况。以下哪项陈述对 Ben 的实现是正确的?
Q 4.7 When Ben adds the timer to his client, he finds some surprises. Which of the following statements is true of Ben’s implementation?
A。服务器可能在任何给定时间为同一辆客户端汽车运行多个活动线程。
A. The server may be running multiple active threads on behalf of the same client car at any given time.
B. The server may assign the same spot to two cars making requests.
C. numcars may be smaller than the actual number of cars parked in the parking lot.
D. numcars may be larger than the actual number of cars parked in the parking lot.
Q 4.8 Alyssa 认为,当许多 RPC 请求同时处理时,运行 Ben 服务器的操作系统可能会花费大量时间在线程之间切换。以下哪项关于执行切换所需工作的陈述是正确的?符号:PC = 程序计数器;SP = 堆栈指针;PMAR = 页映射地址寄存器。假设操作系统按照第 5 章中的描述运行。
Q 4.8 Alyssa thinks that the operating system running Ben’s server may be spending a fair amount of time switching between threads when many RPC requests are being processed concurrently. Which of these statements about the work required to perform the switch is correct? Notation: PC = program counter; SP = stack pointer; PMAR = page-map address register. Assume that the operating system behaves according to the description in Chapter 5.
A。在任何线程切换时,操作系统都会保存PMAR、PC、SP和几个寄存器的值。
A. On any thread switch, the operating system saves the values of the PMAR, PC, SP, and several registers.
B.在任何线程切换时,操作系统都会保存PC、SP和几个寄存器的值。
B. On any thread switch, the operating system saves the values of the PC, SP, and several registers.
C。在两个RELINQUISH_SPOT () 线程之间的任何线程切换时,操作系统仅保存PC的值,因为RELINQUISH_SPOT () 没有返回值。
C. On any thread switch between two RELINQUISH_SPOT ( ) threads, the operating system saves only the value of the PC, since RELINQUISH_SPOT ( ) has no return value.
D.从一个线程切换到另一个线程所需的指令数与线程堆栈上当前的字节数成正比。
D. The number of instructions required to switch from one thread to another is proportional to the number of bytes currently on the thread’s stack.
(第五章)
一群失业的程序员发现,美国的法律限制已经限制了蓬勃发展的在线赌博业,于是他们创办了一家名为 Goomble 的新企业。Goomble 的 Web 服务器允许客户建立帐户,使用信用卡存款,然后通过单击标有“我很幸运”的按钮来玩 Goomble 游戏。每次单击此类按钮都会从他们的帐户中扣除 1 美元,直到余额为零。
Observing that U.S. legal restrictions have curtailed the booming on-line gambling industry, a group of laid-off programmers has launched a new venture called Goomble. Goomble’s Web server allows customers to establish an account, deposit funds using a credit card, and then play the Goomble game by clicking a button labeled I FEEL LUCKY. Every such button click debits their account by $1, until it reaches zero.
Goomble 律师成功地捍卫了他们的游戏,抵御了法律挑战,因为他们辩称其中不涉及赌博:Goomble“服务”是完全确定性的。
Goomble lawyers have successfully defended their game against legal challenges by arguing that there’s no gambling involved: the Goomble “service” is entirely deterministic.
Goomble 服务器的初始实现使用单线程,这会导致所有客户请求按某种顺序执行。每次单击I FEEL LUCKY按钮都会导致对LUCKY ( account )的过程调用,其中account是指表示用户 Goomble 帐户的数据结构。除其他数据外,account 结构还包括一个无符号 32 位整数balance,表示客户当前的美元余额。
The initial implementation of the Goomble server uses a single thread, which causes all customer requests to be executed in some serial order. Each click on the I FEEL LUCKY button results in a procedure call to LUCKY (account), where account refers to a data structure representing the user’s Goomble account. Among other data, the account structure includes an unsigned 32-bit integer balance, representing the customer’s current balance in dollars.
LUCKY程序代码如下:
The LUCKY procedure is coded as follows:
1. LUCKY程序 (账户)
1 procedure LUCKY (account)
2 如果 帐户.余额> 0那么
2 if account.balance > 0 then
3 账户.余额←账户.余额−1
3 account.balance ← account.balance −1
Goomble 软件质量控制专家 Nellie Nervous 检查单线程 Goomble 服务器代码以检查是否存在竞争条件。
The Goomble software quality control expert, Nellie Nervous, inspects the single-threaded Goomble server code to check for race conditions.
Q 5.1 Nellie 是否应该查找任何潜在的竞争条件?为什么或为什么不?2007–1–8
Q 5.1 Should Nellie find any potential race conditions? Why or why not?2007–1–8
Goomble 网站的成功迅速淹没了他们的单线程服务器,限制了 Goomble 的利润。Goomble 聘请了服务器性能专家 Threads Galore 来提高服务器吞吐量。
The success of the Goomble site quickly swamps their single-threaded server, limiting Goomble’s profits. Goomble hires a server performance expert, Threads Galore, to improve server throughput.
Threads 对服务器进行了如下修改:每个I FEEL LUCKY点击请求都会生成一个新线程,该线程调用LUCKY ( account ) 然后退出。所有其他请求(例如,设置帐户、存款等)都由单个线程处理。Threads 认为服务器流量的大部分由玩家点击I FEEL LUCKY组成,因此他的解决方案解决了主要的性能问题。不幸的是,Nellie 没有时间检查多线程版本的服务器。她正忙于开发后续产品:Goomba,它可以同时清空你的银行账户并清洗你的厨房地板。
Threads modifies the server as follows: Each I FEEL LUCKY click request spawns a new thread, which calls LUCKY (account) and then exits. All other requests (e.g., setting up an account, depositing, etc.) are served by a single thread. Threads argues that the bulk of the server traffic consists of player’s clicking I FEEL LUCKY, so that his solution addresses the main performance problem.Unfortunately, Nellie doesn’t have time to inspect the multithreaded version of the server. She is busy with development of a follow-on product: the Goomba, which simultaneously cleans out your bank account and washes your kitchen floor.
Q 5.2假设 Nellie 检查了 Goomble 的多线程服务器。她是否应该发现任何潜在的竞争条件?为什么或为什么不?2007–1–9
Q 5.2 Suppose Nellie had inspected Goomble’s multithreaded server. Should she have found any potential race conditions? Why or why not?2007–1–9
威利·温德福尔 (Willie Windfall) 是一名 Goomble 狂热玩家,他有两台电脑,同时在两台电脑上玩 Goomble(使用同一个 Goomble 账户)。他抵押了自己的房子,耗尽了退休金和为孩子教育攒下的钱,他的 Goomble 账户几乎已经清零。一天早上,他疯狂点击两个屏幕上的“我感觉很幸运”按钮,发现他的 Goomble 余额已跃升至 40 多亿美元。
Willie Windfall, a compulsive Goomble player, has two computers and plays Goomble simultaneously on both (using the same Goomble account). He has mortgaged his house, depleted his retirement fund and the money saved for his kid’s education, and his Goomble account is nearly at zero. One morning, clicking furiously on I FEEL LUCKY buttons on both screens, he notices that his Goomble balance has jumped to something over four billion dollars.
Q 5.3.解释威利好运的可能来源。给出一个简单的场景,涉及两个线程 T1 和 T2,在对LUCKY ( account ) 的调用中交错执行第 2行和第 3行,详细说明可能导致账户余额巨大的时间。场景的第一步已经填写;根据需要填写尽可能多的后续步骤。
Q 5.3. Explain a possible source of Willie’s good fortune. Give a simple scenario involving two threads, T1 and T2, with interleaved execution of lines 2 and 3 in calls to LUCKY (account), detailing the timing that could result in a huge account.balance. The first step of the scenario is already filled in; fill as many subsequent steps as needed.
威利大获全胜的消息迅速传开,Goomble 亿万富翁的数量也随之激增。Goomble 董事会惊慌失措,请你担任顾问,审查三种可能的服务器代码修复方法,以防止 Goomble 客户进一步“中奖”。以下每一项建议都涉及添加锁定(全局锁定或特定于某个帐户的锁定)以排除不幸的竞赛:
Word of Willie’s big win spreads rapidly, and Goomble billionaires proliferate. In a state of panic, the Goomble board calls you in as a consultant to review three possible fixes to the server code to prevent further “gifts” to Goomble customers. Each of the following proposals involves adding a lock (either global or specific to an account) to rule out the unfortunate race:
程序 LUCKY (账户)
procedure LUCKY (account)
获取(全局锁);
ACQUIRE (global_lock);
如果 帐户.余额> 0 ,则
if account.balance > 0 then
账户.余额←账户.余额−1;
account.balance ← account.balance − 1;
释放(全局锁)
RELEASE (global_lock)
程序 LUCKY (账户)
procedure LUCKY (account)
获取(账户.锁定)
ACQUIRE (account.lock)
临时←账户.余额
temp ← account.balance
释放(账户.锁定)
RELEASE (account.lock)
如果 温度> 0那么
if temp > 0 then
获取(账户.锁);
ACQUIRE (account.lock);
账户.余额←账户.余额−1;
account.balance ← account.balance − 1;
释放(帐户.锁定);
RELEASE (account.lock);
程序 LUCKY (账户)
procedure LUCKY (account)
获取(账户.锁);
ACQUIRE (account.lock);
如果 帐户.余额> 0 ,则
if account.balance > 0 then
账户.余额←账户.余额− 1
account.balance ← account.balance − 1
释放(帐户.锁定);
RELEASE (account.lock);
Q 5.4这三个提案中哪一个存在竞争条件?2007–1–11
Q 5.4 Which of the three proposals have race conditions?2007–1–11
Q 5.5考虑到正确性和性能目标,您建议部署哪个提案?2007–1–12
Q 5.5 Which proposal would you recommend deploying, considering both correctness and performance goals?2007–1–12
(Chapter 5 in Chapter 4 setting)
为了减轻系主任的工作量,潜意识科学系安装了 Web 服务器,帮助在秋季教学期间为各门课程分配讲师。课程数量恰好与讲师数量相同,系里的政策是每位讲师只教一门课程,每门课程只有一位讲师。对于系里的每位讲师,服务器都会存储当前分配给该讲师的课程名称。服务器的 Web 界面支持一个请求:交换分配给一对讲师的课程。
The Subliminal Sciences Department, in order to reduce the department head’s workload, has installed a Web server to help assign lecturers to classes for the Fall teaching term. There happen to be exactly as many courses as lecturers, and department policy is that every lecturer teach exactly one course and every course have exactly one lecturer. For each lecturer in the department, the server stores the name of the course currently assigned to that lecturer. The server’s Web interface supports one request: to swap the courses assigned to a pair of lecturers.
服务器代码第一版如下所示:
Version One of the server’s code looks like this:
// 代码版本一
// CODE VERSION ONE
assignments [] // 由讲师 索引的课程名称关联数组
assignments[] // an associative array of course names indexed by lecturer
过程 服务器( )
procedure SERVER ( )
永远做
do forever
m ← 等待请求消息
m ← wait for a request message
value ← m . FUNCTION ( m .arguments , …) // 执行请求消息中的函数
value ← m.FUNCTION (m.arguments, …) // execute function in request message
发送值给m.sender
send value to m.sender
程序 EXCHANGE(讲师1,讲师2)
procedure EXCHANGE (lecturer1, lecturer2)
临时←作业[讲师1 ]
temp ← assignments[lecturer1]
作业[讲师1 ] ←作业[讲师2 ]
assignments[lecturer1] ← assignments[lecturer2]
作业[讲师2 ] ←临时
assignments[lecturer2] ← temp
返回 “OK”
return “OK”
由于服务器上只有一个应用程序线程,因此服务器一次只能处理一个请求。请求包括一个函数及其参数(在本例中为EXCHANGE ( professor1 , professor2 ) ),由SERVER ( ) 过程中的m . FUNCTION ( m .arguments , …) 调用执行。
Because there is only one application thread on the server, the server can handle only one request at a time. Requests comprise a function, and its arguments (in this case EXCHANGE (lecturer1, lecturer2)), which is executed by the m.FUNCTION (m.arguments, …) call in the SERVER ( ) procedure.
对于以下所有问题,假设没有丢失消息,也没有崩溃。操作系统会缓冲传入的消息。当服务器程序请求特定类型的消息(例如,请求)时,操作系统会为其提供该类型的最旧缓冲消息。
For all following questions, assume that there are no lost messages and no crashes. The operating system buffers incoming messages. When the server program asks for a message of a particular type (e.g., a request), the operating system gives it the oldest buffered message of that type.
假设网络传输时间绝不会超过几分之一秒,计算时间也只需几分之一秒。除了伪代码明确提到或暗示的操作外,没有其他并发操作,服务器计算机上也没有其他活动。
Assume that network transmission times never exceed a fraction of a second and that computation also takes a fraction of a second. There are no concurrent operations other than those explicitly mentioned or implied by the pseudocode, and no other activity on the server computers.
假设服务器启动时有以下分配:
Suppose the server starts out with the following assignments:
作业[“希罗多德”] = “隐写术”
assignments[“Herodotus”] = “Steganography”
作业[“奥古斯丁”] = “命理学”
assignments[“Augustine”] = “Numerology”
Q 6.1讲师希罗多德和奥古斯丁决定交换授课,希罗多德教授数字命理学,奥古斯丁教授隐写术。他们同时向服务器发送EXCHANGE(“希罗多德”、“奥古斯丁”)请求。如果您稍后再查看服务器,以下哪种状态(如果有)是可能的?
Q 6.1 Lecturers Herodotus and Augustine decide they wish to swap lectures, so that Herodotus teaches Numerology and Augustine teaches Steganography. They each send an EXCHANGE (“Herodotus”, “Augustine”) request to the server at the same time. If you look a moment later at the server, which, if any, of the following states are possible?
作业[“希罗多德”] = “命理学”
assignments[“Herodotus”] = “Numerology”
作业[“Augustine”] = “隐写术”
assignments[“Augustine”] = “Steganography”
作业[“希罗多德”] = “隐写术”
assignments[“Herodotus”] = “Steganography”
作业[“奥古斯丁”] = “命理学”
assignments[“Augustine”] = “Numerology”
作业[“希罗多德”] = “隐写术”
assignments[“Herodotus”] = “Steganography”
作业[“Augustine”] = “隐写术”
assignments[“Augustine”] = “Steganography”
作业[“希罗多德”] = “命理学”
assignments[“Herodotus”] = “Numerology”
作业[“奥古斯丁”] = “命理学”
assignments[“Augustine”] = “Numerology”
辩证法系决定要有自己的讲师分配服务器。最初,它安装了一个与潜意识科学系完全独立的服务器,使用相同的规则(讲师和课程数量相同,一一匹配)。后来,这两个部门决定允许他们的讲师在任一部门教授课程,因此他们以以下方式扩展服务器软件。讲师可以向任一服务器发送CROSSEXCHANGE请求,要求在该服务器所在部门的讲师和另一服务器所在部门的讲师之间交换课程。为了实现CROSSEXCHANGE,服务器可以相互发送SET-AND-GET请求,设置讲师的课程并返回讲师的上一门课程。以下是两个部门的服务器代码的第二版:
The Department of Dialectic decides it wants its own lecturer assignment server. Initially, it installs a completely independent server from that of the Subliminal Sciences Department, with the same rules (an equal number of lecturers and courses, with a one-to-one matching). Later, the two departments decide that they wish to allow their lecturers to teach courses in either department, so they extend the server software in the following way. Lecturers can send either server a CROSSEXCHANGE request, asking to swap courses between a lecturer in that server’s department and a lecturer in the other server’s department. In order to implement CROSSEXCHANGE, the servers can send each other SET-AND-GET requests, which set a lecturer’s course and return the lecturer’s previous course. Here’s Version Two of the server code, for both departments:
// 代码版本二
// CODE VERSION TWO
过程 SERVER () // 与版本一相同
procedure SERVER () // same as in Version One
程序 EXCHANGE () // 与版本一相同
procedure EXCHANGE () // same as in Version One
程序 CROSSEXCHANGE(本地讲师,远程讲师)
procedure CROSSEXCHANGE (local-lecturer, remote-lecturer)
temp1 ←作业[本地讲师]
temp1 ← assignments[local-lecturer]
将{ SET-AND-GET , remote-lecturer , temp1 } 发送到另一台服务器
send {SET-AND-GET, remote-lecturer, temp1} to the other server
temp2 ←等待对SET-AND-GET的响应
temp2 ← wait for response to SET-AND-GET
作业[本地讲师] ← temp2
assignments[local-lecturer] ← temp2
返回 “OK”
return “OK”
程序 SET-AND-GET(讲师,课程){
procedure SET-AND-GET (lecturer, course) {
旧←作业[讲师]
old ← assignments[lecturer]
作业[讲师] ←课程
assignments[lecturer] ← course
退回 旧
return old
假设 Subliminal Sciences 服务器上的起始状态是:
Suppose the starting state on the Subliminal Sciences server is:
作业[“希罗多德”] = “隐写术”
assignments[“Herodotus”] = “Steganography”
作业[“奥古斯丁”] = “命理学”
assignments[“Augustine”] = “Numerology”
And on the Department of Dialectic server:
作业[“苏格拉底”] = “认识论”
assignments[“Socrates”] = “Epistemology”
作业[“笛卡尔”] = “还原论”
assignments[“Descartes”] = “Reductionism”
Q 6.2与此同时,讲师希罗多德向潜意识科学服务器发送CROSSEXCHANGE (“希罗多德”、“苏格拉底”)请求,讲师笛卡尔向辩证法系服务器发送CROSSEXCHANGE(“笛卡尔”、“奥古斯丁”)请求。如果您一分钟后再查看潜意识科学服务器,以下哪种状态(如果有)是可能的?
Q 6.2 At the same time, lecturer Herodotus sends a CROSSEXCHANGE (“Herodotus”, “Socrates”) request to the Subliminal Sciences server, and lecturer Descartes sends a CROSSEXCHANGE (“Descartes”, “Augustine”) request to the Department of Dialectic server. If you look a minute later at the Subliminal Sciences server, which, if any, of the following states are possible?
作业[“希罗多德”] = “隐写术”
assignments[“Herodotus”] = “Steganography”
作业[“奥古斯丁”] = “命理学”
assignments[“Augustine”] = “Numerology”
作业[“希罗多德”] = “认识论”
assignments[“Herodotus”] = “Epistemology”
作业[“奥古斯丁”] = “还原论”
assignments[“Augustine”] = “Reductionism”
作业[“希罗多德”] = “认识论”
assignments[“Herodotus”] = “Epistemology”
作业[“奥古斯丁”] = “命理学”
assignments[“Augustine”] = “Numerology”
为了提高性能,两个部门将服务器设为多线程:每个服务器在单独的线程中处理每个请求。因此,如果多个请求大致同时到达,服务器可能会并行处理它们。每个服务器都有多个处理器。以下是线程服务器代码,版本三:
In a quest to increase performance, the two departments make their servers multithreaded: each server serves each request in a separate thread. Thus, if multiple requests arrive at roughly the same time, the server may process them in parallel. Each server has multiple processors. Here’s the threaded server code, Version Three:
// 代码版本三
// CODE VERSION THREE
程序 EXCHANGE () // 与版本二相同
procedure EXCHANGE () // same as in Version Two
程序 CROSSEXCHANGE () // 与第二版相同
procedure CROSSEXCHANGE () // same as in Version Two
SET-AND-GET ()程序 // 与第二版相同
procedure SET-AND-GET () // same as in Version Two
过程 服务器()
procedure SERVER ()
永远做
do forever
m ← 等待请求消息
m ← wait for a request message
ALLOCATE_THREAD ( DOIT,m ) // 创建一个运行DOIT ( m ) 的新线程
ALLOCATE_THREAD (DOIT, m) // create a new thread that runs DOIT (m)
程序 DOIT ( m )
procedure DOIT (m)
值← m .函数( m .参数, …)
value ← m.FUNCTION(m.arguments, …)
发送值给m.sender
send value to m.sender
EXIT () //终止此线程
EXIT () // terminate this thread
Q 6.3使用与上一个问题相同的起始状态,但使用新版本代码,讲师希罗多德向潜意识科学服务器发送CROSSEXCHANGE(“希罗多德”,“苏格拉底”)请求,讲师笛卡尔同时向辩证法系服务器发送CROSSEXCHANGE(“笛卡尔”,“奥古斯丁”)请求。如果您一分钟后再查看潜意识科学服务器,以下哪种状态(如果有)是可能的?
Q 6.3 With the same starting state as the previous question, but with the new version of the code, lecturer Herodotus sends a CROSSEXCHANGE (“Herodotus”, “Socrates”) request to the Subliminal Sciences server, and lecturer Descartes sends a CROSSEXCHANGE (“Descartes”, “Augustine”) request to the Department of Dialectic server, at the same time. If you look a minute later at the Subliminal Sciences server, which, if any, of the following states are possible?
作业[“希罗多德”] = “隐写术”
assignments[“Herodotus”] = “Steganography”
作业[“奥古斯丁”] = “命理学”
assignments[“Augustine”] = “Numerology”
作业[“希罗多德”] = “认识论”
assignments[“Herodotus”] = “Epistemology”
作业[“奥古斯丁”] = “还原论”
assignments[“Augustine”] = “Reductionism”
作业[“希罗多德”] = “认识论”
assignments[“Herodotus”] = “Epistemology”
作业[“奥古斯丁”] = “命理学”
assignments[“Augustine”] = “Numerology”
一位机敏的学生注意到版本三可能会受到竞争条件的影响。他修改了代码,让每个讲师都有一个锁,并存储在名为locks [] 的数组中。他将EXCHANGE CROSSEXCHANGE和SET-AND-GET更改为ACQUIRE锁定受影响的讲师。以下是结果,即版本四:
An alert student notes that Version Three may be subject to race conditions. He changes the code to have one lock per lecturer, stored in an array called locks[]. He changes EXCHANGE CROSSEXCHANGE, and SET-AND-GET to ACQUIRE locks on the lecturer(s) they affect. Here is the result, Version Four:
// 代码版本四
// CODE VERSION FOUR
过程 SERVER () // 与版本三相同
procedure SERVER () // same as in Version Three
程序 DOIT() // 与版本三相同
procedure DOIT () // same as in Version Three
程序 EXCHANGE(讲师1,讲师2)
procedure EXCHANGE (lecturer1, lecturer2)
获取(锁定[讲师1 ])
ACQUIRE (locks[lecturer1])
获取(锁定[讲师2 ])
ACQUIRE (locks[lecturer2])
临时←作业[讲师1 ]
temp ← assignments[lecturer1]
作业[讲师1 ] ←作业[讲师2 ]
assignments[lecturer1] ← assignments[lecturer2]
作业[讲师2 ] ←临时
assignments[lecturer2] ← temp
释放(锁定[讲师1 ])
RELEASE (locks[lecturer1])
释放(锁定[讲师2 ])
RELEASE (locks[lecturer2])
返回“OK”
return “OK”
程序 CROSSEXCHANGE(本地讲师,远程讲师)
procedure CROSSEXCHANGE (local-lecturer, remote-lecturer)
获取(锁定[本地讲师])
ACQUIRE (locks[local-lecturer])
temp1 ←作业[本地讲师]
temp1 ← assignments[local-lecturer]
发送 SET-AND-GET,remote-lecturer,temp1到其他服务器
send SET-AND-GET, remote-lecturer, temp1 to other server
temp2 ←等待对SET-AND-GET的响应
temp2 ← wait for response to SET-AND-GET
作业[本地讲师] ← temp2
assignments[local-lecturer] ← temp2
发布(锁定[本地讲师])
RELEASE (locks[local-lecturer])
返回“OK”
return “OK”
程序 SET-AND-GET(讲师,课程)
procedure SET-AND-GET (lecturer, course)
获取(锁定[讲师])
ACQUIRE (locks[lecturer])
旧←作业[讲师]
old ← assignments[lecturer]
作业[讲师] ←课程
assignments[lecturer] ← course
释放(锁定[讲师])
RELEASE (locks[lecturer])
退回 旧
return old
Q 6.4 This code is subject to deadlock. Why?
Q 6.5对于下列每种情况,指出是否会发生死锁。在每种情况下,除上述情况外,没有其他活动。
Q 6.5 For each of the following situations, indicate whether deadlock can occur. In each situation, there is no activity other than that mentioned.
A。客户端 A 发送EXCHANGE(“希罗多德”,“奥古斯丁”),同时客户端 B 发送EXCHANGE(“希罗多德”,“奥古斯丁”),两者都发送到 Subliminal Sciences 服务器。
A. Client A sends EXCHANGE (“Herodotus”, “Augustine”) at the same time that client B sends EXCHANGE (“Herodotus”, “Augustine”), both to the Subliminal Sciences server.
B.客户端 A 发送EXCHANGE(“希罗多德”,“奥古斯丁”),同时客户端 B 发送EXCHANGE(“奥古斯丁”,“希罗多德”),两者都发送给 Subliminal Sciences 服务器。
B. Client A sends EXCHANGE (“Herodotus”, “Augustine”) at the same time that client B sends EXCHANGE (“Augustine”, “Herodotus”), both to the Subliminal Sciences server.
C。客户端 A 向潜意识科学服务器发送CROSSEXCHANGE(“奥古斯丁”,“苏格拉底”),同时客户端 B向辩证法部门服务器发送CROSSEXCHANGE (“笛卡尔”,“希罗多德”)。
C. Client A sends CROSSEXCHANGE (“Augustine”, “Socrates”) to the Subliminal Sciences server at the same time that client B sends CROSSEXCHANGE (“Descartes”, “Herodotus”) to the Department of Dialectic server.
D.客户端 A 向潜意识科学服务器发送CROSSEXCHANGE(“奥古斯丁”,“苏格拉底”),同时客户端 B向辩证法部门服务器发送CROSSEXCHANGE (“苏格拉底”,“奥古斯丁”)。
D. Client A sends CROSSEXCHANGE (“Augustine”, “Socrates”) to the Subliminal Sciences server at the same time that client B sends CROSSEXCHANGE (“Socrates”, “Augustine”) to the Department of Dialectic server.
E.客户端 A 向潜意识科学服务器发送CROSSEXCHANGE(“奥古斯丁”,“苏格拉底”),同时客户端 B向辩证法部门服务器发送CROSSEXCHANGE (“笛卡尔”,“奥古斯丁”)。
E. Client A sends CROSSEXCHANGE (“Augustine”, “Socrates”) to the Subliminal Sciences server at the same time that client B sends CROSSEXCHANGE (“Descartes”, “Augustine”) to the Department of Dialectic server.
(第五章)
银行行长要求 Ben Bitdiddle 为大型银行应用程序添加强制模块化。Ben 将程序分为两部分:客户端和服务。他希望使用远程过程调用在客户端和服务之间进行通信,它们都运行在具有一个处理器的同一台物理机器上。Ben 探索了一种实现,文献中称之为轻量级远程过程调用(LRPC)。Ben 的 LRPC 版本使用用户级门。可以使用两个内核门引导用户门 — 一个门注册用户门的名称,另一个门执行实际传输:
The bank president has asked Ben Bitdiddle to add enforced modularity to a large banking application. Ben splits the program into two pieces: a client and a service. He wants to use remote procedure calls to communicate between the client and service, which both run on the same physical machine with one processor. Ben explores an implementation, which the literature calls lightweight remote procedure call (LRPC). Ben’s version of LRPC uses user-level gates. User gates can be bootstrapped using two kernel gates—one gate that registers the name of a user gate and a second gate that performs the actual transfer:
REGISTER_GATE ( stack , address )。它将地址address注册为要在堆栈 stack 上执行的入口点。内核将这些地址存储在内部表中。
REGISTER_GATE (stack, address). It registers address address as an entry point, to be executed on the stack stack. The kernel stores these addresses in an internal table.
TRANSFER_TO_GATE ( address )。它将控制权转移到地址address。客户端使用此调用将控制权转移到服务。内核必须首先检查address是否是注册为门的地址。如果是,内核将转移控制权;否则它会向调用者返回错误。
TRANSFER_TO_GATE (address). It transfers control to address address. A client uses this call to transfer control to a service. The kernel must first check if address is an address that is registered as a gate. If so, the kernel transfers control; otherwise it returns an error to the caller.
我们假设客户端和服务各自在自己的虚拟地址空间中运行。在初始化时,服务使用REGISTER_GATE注册一个入口点并在地址transfer处分配一个块。客户端和服务都使用READ和WRITE权限将传输块映射到每个地址空间中。客户端和服务使用此共享传输页面来传达远程过程调用的参数和结果。客户端和服务各自以一个线程启动。除了客户端和服务之外,机器上没有运行其他用户程序。
We assume that a client and service each run in their own virtual address space. On initialization, the service registers an entry point with REGISTER_GATE and allocates a block, at address transfer. Both the client and service map the transfer block in each address space with READ and WRITE permissions. The client and service use this shared transfer page to communicate the arguments to and results of a remote procedure call. The client and service each start with one thread. There are no user programs other than the client and service running on the machine.
以下伪代码总结了初始化:
The following pseudocode summarizes the initialization:
服务客户端
Service Client
程序 INIT_SERVICE ()程序 INIT_CLIENT ()
procedure INIT_SERVICE () procedure INIT_CLIENT ()
REGISTER_GATE (STACK,接收) MAP ( my_id,传输,共享客户端)
REGISTER_GATE (STACK, receive) MAP (my_id, transfer, shared_client)
ALLOCATE_BLOCK (转移)
ALLOCATE_BLOCK (transfer)
MAP(我的ID,传输,共享服务器)
MAP (my_id, transfer, shared_server)
当 TRUE时 执行 YIELD ()
while TRUE do YIELD ()
当客户端执行 LRPC 时,客户端会将 LRPC 的参数复制到传输页面中。然后,它调用TRANSFER_TO_GATE将控制权转移到注册地址接收处的服务地址空间。客户端线程现在位于服务的地址空间中,执行请求的操作(未显示地址接收处的过程的代码,因为它对于问题并不重要)。从请求的操作返回后,地址接收处的过程将结果参数写入传输块中,并将控制权转移回客户端的地址空间到过程RETURN_LRPC。一旦回到RETURN_LRPC中的客户端地址空间,客户端就会将结果复制回调用者。以下伪代码总结了LRPC的实现:
When a client performs an LRPC, the client copies the arguments of the LRPC into the transfer page. Then, it calls TRANSFER_TO_GATE to transfer control to the service address space at the registered address receive. The client thread, which is now in the service’s address space, performs the requested operation (the code for the procedure at the address receive is not shown because it is not important for the questions). On returning from the requested operation, the procedure at the address receive writes the result parameters in the transfer block and transfers control back to the client’s address space to the procedure RETURN_LRPC. Once back in the client address space in RETURN_LRPC, the client copies the results back to the caller. The following pseudocode summarizes the implementation of LRPC:
1 个程序 LRPC ( id ,请求)
1 procedure LRPC (id, request)
2副本(请求,共享客户端)
2 copy (request, shared_client)
3 TRANSFER_TO_GATE(接收)
3 TRANSFER_TO_GATE (receive)
4返回
4 return
5
5
6程序 RETURN_LRPC()
6 procedure RETURN_LRPC()
7复制(shared_client,回复)
7 copy (shared_client, reply)
8返回(答复)
8 return (reply)
现在我们知道了如何使用REGISTER_GATE和TRANSFER_TO_GATE这两个程序,接下来让我们看看TRANSFER_TO_GATE的实现(entrypoint是记录门信息的内部内核表):
Now that we know how to use the procedures REGISTER_GATE and TRANSFER_TO_GATE, let’s turn our attention to the implementation of TRANSFER_TO_GATE (entrypoint is the internal kernel table recording gate information):
1程序 transfer_to_gate (地址)
1 procedure transfer_to_gate (address)
2如果 id 存在,且 entrypoint [id].entry = address,则
2 if id exists such that entrypoint[id].entry = address then
3 R 1 ←用户到内核(入口点[id].堆栈)
3 R1 ← user_to_kernel (entrypoint[id].stack)
4 R 2 ←地址
4 R2 ← address
5 STORE R 2, R 1 //将地址放入服务堆栈
5 STORE R2, R1 // put address on service’s stack
6 SP ← entrypoint [ id].stack // 将 SP 设置为服务堆栈
6 SP ← entrypoint[id].stack // set SP to service stack
7 SUB 4 , SP //调整堆栈
7 SUB 4, SP // adjust stack
8 PMAR ← entrypoint[id].pmar //设置页面映射地址
8 PMAR ← entrypoint[id].pmar // set page map address
9 USER ← on //切换到用户模式
9 USER ← on // switch to user mode
10 return //返回地址
10 return // returns to address
11其他
11 else
12返回(错误)
12 return (ERROR)
该过程检查服务是否已将地址注册为入口点(第2行)。第 4至7行将入口地址推送到服务的堆栈上,并将寄存器SP设置为指向服务的堆栈。为此,内核必须将服务地址空间中的堆栈地址转换为内核地址空间中的地址,以便内核可以写入堆栈(第3行)。最后,该过程将服务的页面映射地址寄存器存储到PMAR中(第8行),将用户模式位设置为ON(第9行),并通过从TRANSFER_TO_GATE返回(第 10行)来调用门的过程,该过程将地址从服务的堆栈加载到PC中。
The procedure checks whether or not the service has registered address as an entry point (line 2). Lines 4–7 push the entry address on the service’s stack and set the register SP to point to the service’s stack. To be able to do so, the kernel must translate the address for the stack in the service address space into an address in the kernel address space so that the kernel can write the stack (line 3). Finally, the procedure stores the page-map address register for the service into PMAR (line 8), sets the user-mode bit to ON (line 9), and invokes the gate’s procedure by returning from TRANSFER_TO_GATE (line 10), which loads address from the service’s stack into PC.
此过程的实现很棘手,因为它会切换地址空间,因此实现时必须小心,以确保它引用的是适当地址空间中的适当变量。例如,在第8行之后, TRANSFER_TO_GATE会在服务的地址空间中运行下一条指令(第9行)。这仅在内核映射到客户端和服务的地址空间中的同一地址时才有效。
The implementation of this procedure is tricky because its switches address spaces, and thus the implementation must be careful to ensure that it is referring to the appropriate variable in the appropriate address space. For example, after line 8 TRANSFER_TO_GATE runs the next instruction (line 9) in the service’s address space. This works only if the kernel is mapped in both the client and service’s address space at the same address.
问 7.1过程INIT_SERVICE调用YIELD 。实现监控程序调用YIELD 的代码位于哪个或哪些地址空间中?
Q 7.1 The procedure INIT_SERVICE calls YIELD. In which address space or address spaces is the code that implements the supervisor call YIELD located?
Q 7.2为了使 LRPC 正常工作,传输的两个虚拟地址在客户端和服务地址空间中是否必须具有相同的值?
Q 7.2 For LRPC to work correctly, must the two virtual addresses transfer have the same value in the client and service address space?
Q 7.3在执行位于地址接收的过程期间,有多少个线程正在运行或在服务地址空间中调用YIELD ?
Q 7.3 During the execution of the procedure located at address receive how many threads are running or are in a call to YIELD in the service address space?
Q 7.4 How many supervisor calls could the client perform in the procedure LRPC?
Q 7.5 Ben 的目标是强制模块化。以下哪项陈述是关于 Ben 的LRPC实现的真实陈述?
Q 7.5 Ben’s goal is to enforce modularity. Which of the following statements are true statements about Ben’s LRPC implementation?
A。客户端线程不能将控制权转移到服务器地址空间中的任何地址。
A. The client thread cannot transfer control to any address in the server address space.
B. The client thread cannot overwrite any physical memory that is mapped in the server’s address space.
C。客户端在LRPC中调用TRANSFER_TO_GATE后,服务器保证调用RETURN_LRPC。
C. After the client has invoked TRANSFER_TO_GATE in LRPC, the server is guaranteed to invoke RETURN_LRPC.
D. The procedure LRPC ought to be modified to check the response message and process only valid responses.
Q 7.6假设REGISTER_GATE和TRANSFER_TO_GATE也被其他程序使用。下列哪项关于REGISTER_GATE和TRANSFER_TO_GATE实现的说法是正确的?
Q 7.6 Assume that REGISTER_GATE and TRANSFER_TO_GATE are also used by other programs. Which of the following statements is true about the implementations of REGISTER_GATE and TRANSFER_TO_GATE?
A。内核在将用户程序传入的堆栈上的值地址写入时,可能会使用无效的地址。
A. The kernel might use an invalid address when writing the value address on the stack passed in by a user program.
B. A user program might use an invalid address when entering the service address space.
C。内核将控制权转移到服务器地址空间,并将用户模式位设置为OFF。
C. The kernel transfers control to the server address space with the user-mode bit switched OFF.
D. The kernel enters the server address space only at the registered address entry address.
Ben 将客户端修改为具有多个执行线程。如果一个客户端线程调用服务器,并且地址为接收的过程调用YIELD,则另一个客户端线程可以在处理器上运行。
Ben modifies the client to have multiple threads of execution. If one client thread calls the server and the procedure at address receive calls YIELD, another client thread can run on the processor.
Q 7.7 Which of the following statements is true about the implementation of LRPC with multiple threads?
A。在单处理器机器上,当多个客户端线程调用LRPC时,可能会出现竞争条件,即使内核非抢先地调度线程。
A. On a single-processor machine, there can be race conditions when multiple client threads call LRPC, even if the kernel schedules the threads non-preemptively.
B.在单处理器机器上,当多个客户端线程调用LRPC并且内核抢先调度线程时,可能会出现竞争条件。
B. On a single-processor machine, there can be race conditions when multiple clients threads call LRPC and the kernel schedules the threads preemptively.
C。在多处理器计算机上,当多个客户端线程调用LRPC时,可能会出现竞争条件。
C. On multiprocessor computer, there can be race conditions when multiple client threads call LRPC.
D. It is impossible to have multiple threads if the computer doesn’t have multiple physical processors.
(第五章)
Ben Bitdiddle 正在为新款手持电脑 Bitdiddler 设计文件系统,该系统的设计特别简单,正如他喜欢说的,“适合像我这样的普通人”。
Ben Bitdiddle is designing a file system for a new handheld computer, the Bitdiddler, which is designed to be especially simple for, as he likes to say, “people who are just average, like me.”
为了坚持简单易用的主题,Ben 决定设计一个没有目录的文件系统。磁盘在物理上被划分为三个区域:一个 inode 列表、一个空闲列表和一个 4K 数据块集合,与UNIX文件系统非常相似。与UNIX文件系统不同,每个 inode 都包含其对应的文件的名称,以及一个指示该 inode 是否正在使用的位。与UNIX文件系统一样,inode 还包含组成文件的块列表,以及有关文件的元数据,包括权限位、文件长度(以字节为单位)以及修改和创建时间戳。空闲列表是一个位图,每个数据块有一个位,指示该块是空闲的还是正在使用的。Ben 的文件系统中没有间接块。下图说明了 Bitdiddler 文件系统的基本布局:
In keeping with his theme of simplicity and ease of use for average people, Ben decides to design a file system without directories. The disk is physically partitioned into three regions: an inode list, a free list, and a collection of 4K data blocks, much like the UNIX file system. Unlike in the UNIX file system, each inode contains the name of the file it corresponds to, as well as a bit indicating whether or not the inode is in use. Like the UNIX file system, the inode also contains a list of blocks that compose the file, as well as metadata about the file, including permission bits, its length in bytes, and modification and creation timestamps. The free list is a bitmap, with one bit per data block indicating whether that block is free or in use. There are no indirect blocks in Ben’s file system. The following figure illustrates the basic layout of the Bitdiddler file system:
文件系统提供六个主要调用:CREATE、OPEN、READ、WRITE、CLOSE和UNLINK。Ben 以简单明了的方式正确实现了所有六个调用,如下所示。对磁盘的所有更新都是同步的;也就是说,当将数据块写入磁盘的调用返回时,该块肯定已安装在磁盘上。单个块写入是原子的。
The file system provides six primary calls: CREATE, OPEN, READ, WRITE, CLOSE, and UNLINK. Ben implements all six correctly and in a straightforward way, as shown below. All updates to the disk are synchronous; that is, when a call to write a block of data to the disk returns, that block is definitely installed on the disk. Individual block writes are atomic.
过程 CREATE (文件名)
procedure CREATE (filename)
扫描所有非空闲的 inode 以确保文件名不重复(如果重复则返回错误)
scan all non-free inodes to ensure filename is not a duplicate (return ERROR if duplicate)
在 inode 列表中找到一个空闲的 inode
find a free inode in the inode list
用 0 个数据块更新 inode,将其标记为正在使用,并将其写入磁盘
update the inode with 0 data blocks, mark it as in use, write it to disk
更新空闲列表以指示 inode 正在使用中,将空闲列表写入磁盘
update the free list to indicate the inode is in use, write free list to disk
过程 OPEN ( filename ) // 返回文件句柄
procedure OPEN (filename) // returns a file handle
扫描非空闲 inode 以查找文件名
scan non-free inodes looking for filename
如果找到,分配并返回引用该 inode 的文件句柄fh
if found, allocate and return a file handle fh that refers to that inode
过程 WRITE ( fh,buf,len )
procedure WRITE (fh, buf, len)
查看文件句柄fh以确定文件的 inode,从磁盘读取 inode
look in file handle fh to determine inode of the file, read inode from disk
如果文件最后一个块有可用空间,则写入该空间
if there is free space in last block of file, write to it
确定所需的新块数量n
determine number of new blocks needed, n
对于i ← 1至 n
for i ← 1 to n
使用空闲列表找到空闲块b
use free list to find a free block b
更新空闲列表以显示b正在使用,将空闲列表写入磁盘
update free list to show b is in use, write free list to disk
将b添加到 inode,将 inode 写入磁盘
add b to inode, write inode to disk
将块b的适当数据写入磁盘
write appropriate data for block b to disk
程序 READ ( fh,buf,len )
procedure READ (fh, buf, len)
查看文件句柄fh以确定文件的 inode,从磁盘读取 inode
look in file handle fh to determine inode of the file, read inode from disk
从文件的当前位置读入len字节数据到buf中
read len bytes of data from the current location in file into buf
程序 CLOSE ( fh )
procedure CLOSE (fh)
从文件句柄表中删除fh
remove fh from the file handle table
过程 UNLINK (文件名)
procedure UNLINK (filename)
扫描非空闲 inode 以查找文件名,并将该 inode 标记为空闲
scan non-free inodes looking for filename, mark that inode as free
将 inode 写入磁盘
write inode to disk
在空闲列表中将文件使用的数据块标记为空闲
mark data blocks used by file as free in free list
将修改后的空闲列表块写入磁盘
write modified free list blocks to disk
Ben 为 Bitdiddler 编写了以下简单的应用程序:
Ben writes the following simple application for the Bitdiddler:
创建(文件名)
CREATE (filename)
fh ← OPEN(文件名)
fh ← OPEN (filename)
WRITE ( fh , app_data , LENGTH ( app_data ))//app_data 是需要写入的数据
WRITE (fh, app_data, LENGTH (app_data))//app_data is some data to be written
关闭( fh )
CLOSE (fh)
问 8.1 Ben 注意到,如果他在运行应用程序时将电池从 Bitdiddler 中取出,然后更换电池并重新启动机器,则他的应用程序创建的文件存在,但包含他未写入文件的意外数据。以下哪项可以解释此行为?(假设磁盘控制器从不写入部分块。)
Q 8.1 Ben notices that if he pulls the batteries out of the Bitdiddler while running his application and then replaces the batteries and reboots the machine, the file his application created exists but contains unexpected data that he didn’t write into the file. Which of the following are possible explanations for this behavior? (Assume that the disk controller never writes partial blocks.)
A。通过WRITE调用分配的数据页的空闲列表条目被写入磁盘,但是 inode 和数据页本身均未被写入。
A. The free list entry for a data page allocated by the call to WRITE was written to disk, but neither the inode nor the data page itself was written.
B.分配给 Ben 的应用程序的 inode 之前包含一个同名文件(现已删除)。如果系统在调用CREATE期间崩溃,则可能会导致旧文件及其先前的内容重新出现。
B. The inode allocated to Ben’s application previously contained a (since deleted) file with the same name. If the system crashed during the call to CREATE, it may cause the old file to reappear with its previous contents.
C。通过WRITE调用分配的数据页的空闲列表条目以及 inode 的新副本被写入磁盘,但数据页本身没有被写入。
C. The free list entry for a data page allocated by the call to WRITE as well as a new copy of the inode were written to disk, but the data page itself was not.
D.通过WRITE调用分配的数据页的空闲列表条目以及数据页本身被写入磁盘,但新的 inode 没有被写入。
D. The free list entry for a data page allocated by the call to WRITE as well as the data page itself were written to disk, but the new inode was not.
问 8.2 Ben 决定通过在 Bitdiddler 每次启动时扫描磁盘上的数据结构来修复 Bitdiddler 文件系统中的不一致问题。使用此方法可以识别以下哪些不一致问题(无需修改 Bitdiddler 实现)?
Q 8.2 Ben decides to fix inconsistencies in the Bitdiddler’s file system by scanning its data structures on disk every time the Bitdiddler starts up. Which of the following inconsistencies can be identified using this approach (without modifying the Bitdiddler implementation)?
(第五章)
Ben 为一台简单的计算机开发了一个操作系统。该操作系统有一个内核,提供虚拟地址空间、线程和控制台输出。
Ben develops an operating system for a simple computer. The operating system has a kernel that provides virtual address spaces, threads, and output to a console.
每个应用程序都有自己的用户级地址空间并使用一个线程。内核程序在内核地址空间中运行,但没有自己的线程。(内核程序将在下文中更详细地描述。)
Each application has its own user-level address space and uses one thread. The kernel program runs in the kernel address space but doesn’t have its own thread. (The kernel program is described in more detail below.)
计算机有一个处理器、一块内存、一个定时器芯片(后面会介绍)、一个控制台设备,以及连接这些设备的总线。处理器有一个用户模式位,并且采用多寄存器集设计,也就是说它有两组程序计数器(PC)、堆栈指针(SP)和页映射地址寄存器(PMAR)。一组用于用户空间(用户模式位设置为ON):upc、usp和upmar。另一组用于内核空间(用户模式位设置为OFF):kpc、ksp和kpmar。只有内核模式下的程序才允许存储到upmar、kpc、ksp和kpmar —— 在用户模式下,将值存储在这些寄存器中是非法指令。
The computer has one processor, a memory, a timer chip (which will be introduced later), a console device, and a bus connecting the devices. The processor has a user-mode bit and is a multiple register set design, which means that it has two sets of program counter (PC), stack pointer (SP), and page-map address registers (PMAR). One set is for user space (the user-mode bit is set to ON): upc, usp, and upmar. The other set is for kernel space (the user-mode bit is set to OFF): kpc, ksp, and kpmar. Only programs in kernel mode are allowed to store to upmar, kpc, ksp, and kpmar—storing a value in these registers is an illegal instruction in user mode.
当以下三个事件之一发生时,处理器将从用户模式切换到内核模式:应用程序发出非法指令、应用程序发出监控程序调用指令(使用SVC指令)或处理器在用户模式下收到中断。处理器通过将用户模式位设置为OFF来从用户模式切换到内核模式。发生这种情况时,处理器继续操作,但使用kpc、ksp和kpmar中的当前值。用户程序计数器、堆栈指针和页面映射地址值分别保留在upc、usp和upmar中。
The processor switches from user to kernel mode when one of three events occurs: an application issues an illegal instruction, an application issues a supervisor call instruction (with the SVC instruction), or the processor receives an interrupt in user mode. The processor switches from user to kernel mode by setting the user-mode bit OFF. When that happens, the processor continues operation but using the current values in the kpc, ksp, and kpmar. The user program counter, stack pointer, and page-map address values remain in upc, usp, and upmar, respectively.
要从内核返回到用户空间,内核程序会执行RTI指令,该指令将用户模式位设置为ON,从而使处理器使用upc、usp和upmar。kpc、ksp和kpmar值保持不变,等待下一个SVC。除了这些寄存器之外,处理器还有四个通用寄存器:ur0、ur1、kr0和kr1。ur0和ur1对在用户模式下处于活动状态。kr0和kr1对在内核模式下处于活动状态。
To return from kernel to user space, a kernel program executes the RTI instruction, which sets the user-mode bit to ON, causing the processor to use upc, usp, and upmar. The kpc, ksp, and kpmar values remain unchanged, awaiting the next SVC. In addition to these registers, the processor has four general-purpose registers: ur0, ur1, kr0, and kr1. The ur0 and ur1 pair are active in user mode. The kr0 and kr1 pair are active in kernel mode.
Ben 运行两个用户应用程序。每个应用程序都执行以下一组程序:
Ben runs two user applications. Each executes the following set of programs:
整数 t 最初为1 // 共享变量t的初始值
integer t initially 1 // initial value for shared variable t
程序 MAIN ()
procedure MAIN ()
永远做
do forever
t ← t + t
t ← t + t
打印(t)
PRINT (t)
屈服()
YIELD ()
程序 YIELD
procedure YIELD
安全虚拟控制器0
SVC 0
PRINT在输出控制台上打印t的值。输出控制台是仅输出设备,不产生中断。
PRINT prints the value of t on the output console. The output console is an output-only device and generates no interrupts.
内核在自己的用户级地址空间中运行每个程序。每个用户地址空间都有一个线程(具有自己的堆栈),由内核管理:
The kernel runs each program in its own user-level address space. Each user address space has one thread (with its own stack), which is managed by the kernel:
整数 currentthread //当前用户线程的索引
integer currentthread // index for the current user thread
结构 thread [2] // 线程不运行时的状态存储处
structure thread[2] // Storage place for thread state when not running
整数 sp //用户堆栈指针
integer sp // user stack pointer
整数 pc //用户程序计数器
integer pc // user program counter
整数 pmar //用户页面映射地址寄存器
integer pmar // user page-map address register
整数 r0 //用户寄存器 0
integer r0 // user register 0
整数 r1 //用户寄存器 1
integer r1 // user register 1
过程 DOYIELD ()
procedure DOYIELD ()
thread [ currentthread ] .sp ← usp //保存寄存器
thread[currentthread].sp ← usp // save registers
线程[当前线程] .pc ← upc
thread[currentthread].pc ← upc
线程[当前线程]. pmar ← upmar
thread[currentthread].pmar ← upmar
线程[当前线程]. r0 ← ur0
thread[currentthread].r0 ← ur0
线程[当前线程]. r1 ← ur1
thread[currentthread].r1 ← ur1
currentthread ← ( currentthread + 1) modulo 2 // 选择新线程
currentthread ← (currentthread + 1) modulo 2 // select new thread
usp ← thread [ currentthread ].sp // 恢复寄存器
usp ← thread[currentthread].sp // restore registers
upc ←线程[当前线程]. pc
upc ← thread[currentthread].pc
upmar ←线程[当前线程]. pmar
upmar ← thread[currentthread].pmar
ur0 ←线程[当前线程]. r0
ur0 ← thread[currentthread].r0
ur1 ←线程[当前线程]. r1
ur1 ← thread[currentthread].r1
为简单起见,此非抢占式线程管理器仅针对 Ben 的内核上运行的两个用户线程进行定制。系统通过执行过程KERNEL来启动。以下是其代码:
For simplicity, this non-preemptive thread manager is tailored for just the two user threads that are running on Ben’s kernel. The system starts by executing the procedure KERNEL. Here is its code:
过程 KERNEL ()
procedure KERNEL ()
CREATE_THREAD ( MAIN ) // 设置 Ben 的两个线程
CREATE_THREAD (MAIN) // Set up Ben’s two threads
创建线程(主线程)//
CREATE_THREAD (MAIN)//
usp ← thread [1].sp // 初始化线程 1 的用户寄存器
usp ← thread[1].sp // initialize user registers for thread 1
upc ←线程[1]. pc
upc ← thread[1].pc
upmar ←主题[1]. pmar
upmar ← thread[1].pmar
ur0 ←线程[1]. r0
ur0 ← thread[1].r0
ur1 ←线程[1]. r1
ur1 ← thread[1].r1
永远做
do forever
RTI // 运行用户线程,直到发出 SVC
RTI // Run a user thread until it issues an SVC
n ← ??? // 参见问题Q 9.1
n ← ??? // See question Q 9.1
如果 n = 0则 DOYIELD ()
if n = 0 then DOYIELD()
由于内核使用RTI指令将控制权传递给用户,因此当用户执行SVC时,处理器会在内核中继续执行RTI之后的指令。
Since the kernel passes control to the user with the RTI instruction, when the user executes an SVC, the processor continues execution in the kernel at the instruction following the RTI.
Ben 的操作系统设置了三个页面映射,一个用于每个用户程序,一个用于内核程序。Ben 仔细地设置了页面映射,以便三个地址空间不共享任何物理内存。
Ben’s operating system sets up three page maps, one for each user program, and one for the kernel program. Ben has carefully set up the page maps so that the three address spaces don’t share any physical memory.
Q 9.1描述监控程序如何获取n的值,该值是调用程序调用的SVC的标识符。
Q 9.1 Describe how the supervisor obtains the value of n, which is the identifier for the SVC that the calling program has invoked.
Q 9.2 How can the current address space be switched?
A. By the kernel writing the kpmar register.
B. By the kernel writing the upmar register.
C. By the processor changing the user-mode bit.
问 9.3 Ben 运行系统一段时间,观察它打印几个结果,然后暂停处理器以检查其状态。他发现它位于内核中,即将执行RTI指令。当内核执行该RTI指令时,用户级线程可以在哪个过程中恢复?
Q 9.3 Ben runs the system for a while, watching it print several results, and then halts the processor to examine its state. He finds that it is in the kernel, where it is just about to execute the RTI instruction. In which procedure(s) could the user-level thread resume when the kernel executes that RTI instruction?
Q 9.4在 Ben 的设计中,哪些机制在强制模块化方面发挥了作用?
Q 9.4 In Ben’s design, what mechanisms play a role in enforcing modularity?
A。分离地址空间,因为一个应用程序的任意写入无法修改另一个应用程序的数据。
A. Separate address spaces because wild writes from one application cannot modify the data of the other application.
B.用户模式位,因为它禁止用户程序写入 upmar 和 kpmar。
B. User-mode bit because it disallows user programs to write to upmar and kpmar.
C. The kernel because it forces threads to give up the processor.
Ben 在他的硬件手册中读到了有关定时器芯片的内容,并决定修改内核以利用它。在初始化时,内核启动定时器芯片,它将每 100 毫秒生成一次中断。(Ben 的计算机没有其他中断源。)请注意,在内核地址空间中执行时,中断启用位为OFF;处理器仅在执行用户模式指令之前检查中断。因此,每当定时器芯片在处理器处于内核模式时生成中断时,中断将被延迟,直到处理器返回用户模式。用户模式下的中断会导致 SVC -1指令插入指令流中。最后,Ben 通过替换do forever循环并添加中断处理程序来修改内核,如下所示:
Ben reads about the timer chip in his hardware manual and decides to modify the kernel to take advantage of it. At initialization time, the kernel starts the timer chip, which will generate an interrupt every 100 milliseconds. (Ben’s computer has no other sources of interrupts.) Note that the interrupt-enable bit is OFF when executing in the kernel address space; the processor checks for interrupts only before executing a user-mode instruction. Thus, whenever the timer chip generates an interrupt while the processor is in kernel mode, the interrupt will be delayed until the processor returns to user mode. An interrupt in user mode causes an SVC -1 instruction to be inserted in the instruction stream. Finally, Ben modifies the kernel by replacing the do forever loop and adding an interrupt handler, as follows:
永远做
do forever
RTI // 运行用户线程,直到发出 SVC
RTI // Run a user thread until it issues an SVC
n ← ??? // 假设问题Q 9.1 的答案
n ← ??? // Assume answer to question Q 9.1
如果 n = 1则 DOINTERRUPT ()
if n = 1 then DOINTERRUPT ()
如果 n = 0则 DOYIELD ()
if n = 0 then DOYIELD ()
程序 DOINTERRUPT ()
procedure DOINTERRUPT ()
产量()
DOYIELD ()
Do not make any assumption about the speed of the processor.
问 9.5 Ben 再次运行系统一段时间,观察它打印几个结果,然后他暂停处理器以检查其状态。再一次,他发现它位于内核中,即将执行 RTI 指令。内核执行RTI指令后,用户级线程可以在哪个过程中恢复?
Q 9.5 Ben again runs the system for a while, watching it print several results, and then he halts the processor to examine its state. Once again, he finds that it is in the kernel, where it is just about to execute the RTI instruction. In which procedure(s) could the user-level thread resume after the kernel executes the RTI instruction?
Q 9.6在 Ben 的第二个设计中,哪些机制在强制模块化方面发挥了作用?
Q 9.6 In Ben’s second design, what mechanisms play a role in enforcing modularity?
A。分离地址空间,因为一个应用程序的任意写入无法修改另一个应用程序的数据。
A. Separate address spaces because wild writes from one application cannot modify the data of the other application.
B.用户模式位,因为它禁止用户程序写入 upmar 和 kpmar。
B. User-mode bit because it disallows user programs to write to upmar and kpmar.
C. The timer chip because it, in conjunction with the kernel, forces threads to give up the processor.
Ben 修改了两个用户程序以共享变量t,方法是将t映射到两个用户程序的虚拟地址空间中的物理内存中的同一位置。现在两个线程都读取和写入同一个t。请注意,寄存器不在线程之间共享:调度程序在线程切换时保存和恢复寄存器。 Ben 的简单编译器翻译了关键代码区域:
Ben modifies the two user programs to share the variable t, by mapping t in the virtual address space of both user programs at the same place in physical memory. Now both threads read and write the same t.Note that registers are not shared between threads: the scheduler saves and restores the registers on a thread switch. Ben’s simple compiler translates the critical region of code:
t ← t + t
t ← t + t
into the processor instructions:
100 LOAD t, r0 //将t读入寄存器0
100 LOAD t, r0 // read t into register 0
104 LOAD t , r1 //将t读入寄存器1
104 LOAD t, r1 // read t into register 1
108 ADD r1 , r0 // 将寄存器 0 和 1 相加,将结果保存在寄存器 0 中
108 ADD r1, r0 // add registers 0 and 1, leave result in register 0
112 STORE r0 , t //将寄存器 0 存储到t中
112 STORE r0, t // store register 0 into t
此代码最左列的数字是指令在两个虚拟地址空间中存储的虚拟地址。Ben 的处理器以原子方式执行各个指令。
The numbers in the leftmost column in this code are the virtual addresses where the instructions are stored in both virtual address spaces. Ben’s processor executes the individual instructions atomically.
Q 9.7 What values can the applications print (don’t worry about overflows)?
在一次会议记录中,Ben 读到了一个称为可重新启动原子区域 *的想法,并实现了它。如果线程在关键区域中被中断,则线程管理器在恢复线程时会在关键区域的开头重新启动该线程。Ben 重新编码了中断处理程序,如下所示:
In a conference proceedings, Ben reads about an idea called restartable atomic regions * and implements them. If a thread is interrupted in a critical region, the thread manager restarts the thread at the beginning of the critical region when it resumes the thread. Ben recodes the interrupt handler as follows:
程序 DOINTERRUPT ()
procedure DOINTERRUPT ()
如果 upc ≥ 100且 upc ≤ 112则 // 我们是否处于临界区域?
if upc ≥ 100 and upc ≤ 112 then // Were we in the critical region?
upc ← 100 // 是的,恢复时重新启动关键区域!
upc ← 100 // yes, restart critical region when resumed!
产量()
DOYIELD ()
The processor increments the program counter after interpreting an instruction and before processing interrupts.
Q 9.8 Now, what values can the applications print (don’t worry about overflows)?
Q 9.9当第一个线程位于虚拟地址 100 到 112 的区域中时,第二个线程是否可以进入该区域(即,第一个线程的upc包含 100 到 112 范围内的值)?
Q 9.9 Can a second thread enter the region from virtual addresses 100 through 112 while the first thread is in it (i.e., the first thread’s upc contains a value in the range 100 through 112)?
A。是的,因为当第一个线程位于该区域时,中断可能会导致处理器切换到第二个线程,并且第二个线程可能会进入该区域。
A. Yes, because while the first thread is in the region, an interrupt may cause the processor to switch to the second thread and the second thread might enter the region.
B.是的,因为处理器不会自动执行DOINTERRUPT中的前三行代码。
B. Yes, because the processor doesn’t execute the first three lines of code in DOINTERRUPT atomically.
C. Yes, because the processor doesn’t execute DOYIELD atomically.
Ben 正在探索是否可以将任何代码放入可重启原子区域中。他创建了一个可重启原子区域,其中包含三条指令,这些指令使用临时x交换两个变量a和b的内容:
Ben is exploring if he can put just any code in a restartable atomic region. He creates a restartable atomic region that contains three instructions, which swap the content of two variables a and b using a temporary x:
100x ←一个
100 x ← a
104 a ← b
104 a ← b
108 b ← x
108 b ← x
Ben 还修改了DOINTERRUPT,将 112 替换为 108:
Ben also modifies DOINTERRUPT, replacing 112 with 108:
程序 DOINTERRUPT ()
procedure DOINTERRUPT ()
如果 upc ≥ 100且 upc ≤ 108则 // 我们是否处于临界区域?
if upc ≥ 100 and upc ≤ 108 then // Were we in the critical region?
upc ← 100; // 是的,恢复时重新启动关键区域!
upc ← 100; // yes, restart critical region when resumed!
产量()
DOYIELD ()
变量a和b 的初始值分别为a = 1 和b = 2,计时器芯片正在运行。
Variables a and b start out with the values a = 1 and b = 2, and the timer chip is running.
Q 9.10如果一个线程执行这个可重新启动的原子区域并且变量a、b和x不共享,可能出现哪些结果?
Q 9.10 What are some possible outcomes if a thread executes this restartable atomic region and variables a, b, and x are not shared?
(第五章)
Ben Bitdiddle 决定设计一个计算机系统,该系统基于一种新的内核架构(他称之为picokernels)和一种新的硬件平台(称为simplePC)。Ben 仔细阅读了第 1.1 节,力求做到极致简单。simplePC 平台包含一个简单的处理器、一个基于页面的虚拟内存管理器(用于转换处理器发出的虚拟地址)、一个内存模块和一个输入输出设备。该处理器有两个特殊寄存器、一个程序计数器 ( PC ) 和一个堆栈指针 ( SP )。SP指向堆栈顶部的值。
Ben Bitdiddle decides to design a computer system based on a new kernel architecture he calls picokernels and on a new hardware platform called simplePC. Ben has paid attention to Section 1.1 and is going for extreme simplicity. The simplePC platform contains one simple processor, a page-based virtual memory manager (which translates the virtual addresses issued by the processor), a memory module, and an input and output device. The processor has two special registers, a program counter (PC) and a stack pointer (SP). The SP points to the value on the top of the stack.
simplePC 处理器的调用约定使用简单的堆栈模型:
The calling convention for the simplePC processor uses a simple stack model:
对过程的调用将调用后的指令地址推送到堆栈,然后跳转到该过程。
A call to a procedure pushes the address of the instruction after the call onto the stack and then jumps to the procedure.
Return from a procedure pops the address from the top of the stack and jumps.
simplePC 上的程序不使用局部变量。过程的参数通过寄存器传递,不会自动保存和恢复。因此,堆栈上的唯一值是返回地址。
Programs on the simplePC don’t use local variables. Arguments to procedures are passed in registers, which are not saved and restored automatically. Therefore, the only values on the stack are return addresses.
Ben 开发了一个简单的股票行情系统来跟踪他加入的初创公司的股票。该程序从输入设备读取一条包含单个整数的消息,并将其打印在输出设备上:
Ben develops a simple stock-ticker system to track the stocks of the start-up he joined. The program reads a message containing a single integer from the input device and prints it on the output device:
101.布尔 输入_可用
101. boolean input_available
1.程序 READ_INPUT ()
1. procedure READ_INPUT ()
2.永远做下去
2. do forever
3.当 input_available = FALSE时 不执行任何操作 // 空闲循环
3. while input_available = FALSE do nothing // idle loop
4. PRINT_MSG(引用)
4. PRINT_MSG(quote)
5. input_available ← FALSE
5. input_available ← FALSE
200.布尔 输出_完成
200. boolean output_done
201.结构 output_buffer at 71FFF2 hex //输出缓冲区的硬件地址
201. structure output_buffer at 71FFF2hex // hardware address of output buffer
202.整数引用
202. integer quote
12.程序 PRINT_MSG(m)
12. procedure PRINT_MSG (m)
13.输出缓冲区.quote ← m
13. output_buffer.quote ← m
14. while output_done = FALSE 不执行任何操作 // 空闲循环
14. while output_done = FALSE do nothing // idle loop
15.输出完成← FALSE
15. output_done ← FALSE
17.程序 MAIN ()
17. procedure MAIN ()
18.读取输入()
18. READ_INPUT ()
19. halt //关闭计算机
19. halt // shutdown computer
除了MAIN程序之外,该程序还包含两个过程:READ_INPUT和PRINT_MSG。过程READ_INPUT一直等待,直到输入设备(股票读取器)将input_available设置为TRUE 。当输入设备收到股票报价时,它会将报价值放入msg中,并将input_available设置为TRUE。
In addition to the MAIN program, the program contains two procedures: READ_INPUT and PRINT_MSG. The procedure READ_INPUT spin-waits until input_available is set to TRUE by the input device (the stock reader). When the input device receives a stock quote, it places the quote value into msg and sets input_available to TRUE.
过程PRINT_MSG在输出设备(在本例中为终端)上打印消息;它将消息中存储的值写入设备并等待打印;输出设备在完成打印后将output_done设置为TRUE 。
The procedure PRINT_MSG prints the message on an output device (a terminal in this case); it writes the value stored in the message to the device and waits until it is printed; the output device sets output_done to TRUE when it finishes printing.
每行上的数字对应于处理器发出的读写指令和数据的地址。假设每行伪代码编译成一条机器指令,并且每个过程结束时都有一个隐式返回。
The numbers on each line correspond to addresses as issued by the processor to read and write instructions and data. Assume that each line of pseudocode compiles into one machine instruction and that there is an implicit return at the end of each procedure.
Q 10.1 What do these numbers mentioned on each line of the program represent?
Ben 直接在 simplePC 上运行该程序,从MAIN开始,在某个时刻,他在堆栈上观察到以下值(请记住,只有股票行情程序正在运行):
Ben runs the program directly on simplePC, starting in MAIN, and at some point he observes the following values on the stack (remember, only the stock-ticker program is running):
堆
stack
19
19
5 ← 堆栈指针
5 ← stack pointer
Q 10.2 What is the meaning of the value 5 on the stack?
A. The return address for the next return instruction.
Q 10.3 Which procedure is being executed by the processor?
问 10.4 PRINT_MSG将一个值写入quote,该值存储在地址 71FFF2十六进制中,期望该值最终到达终端。使用什么技术来实现此功能?
Q 10.4 PRINT_MSG writes a value to quote, which is stored at the address 71FFF2hex, with the expectation that the value will end up on the terminal. What technique is used to make this work?
Ben 想要在 simplePC 平台上运行他的股票行情程序的多个实例,以便能够获得更频繁的更新,从而更准确地跟踪他当前的净资产。Ben 为系统购买了另一个输入和输出设备,将它们连接起来,并实现了一个简单的线程管理器:
Ben wants to run multiple instances of his stock-ticker program on the simplePC platform so that he can obtain more frequent updates to track more accurately his current net worth. Ben buys another input and output device for the system, hooks them up, and he implements a trivial thread manager:
300. integer threadtable [2]; //存储线程的堆栈指针。
300. integer threadtable[2]; // stores stack pointers of threads.
// 第一个槽是线程表[0]
// first slot is threadtable[0]
302.整数 current_thread 最初为0;
302. integer current_thread initially 0;
21.过程 YIELD ()
21. procedure YIELD ()
22. threadtable [ current_thread ] ← SP // 将SP 的值移入表中
22. threadtable[current_thread] ← SP // move value of SP into table
23. current_thread ← ( current_thread + 1)模2
23. current_thread ← (current_thread + 1) modulo 2
24. SP ← threadtable [ current_thread ] // 将值从表加载到SP中
24. SP ← threadtable[current_thread] // load value from table into SP
25.返回
25. return
每个线程都从自己的设备读取和写入数据,并拥有自己的堆栈。Ben 还修改了READ_INPUT:
Each thread reads from and writes to its own device and has its own stack. Ben also modifies READ_INPUT:
100.整数 消息[2] // 改为使用数组
100. integer msg[2] // CHANGED to use array
102. boolean input_available [2] // 改为使用数组
102. boolean input_available[2] // CHANGED to use array
30.过程 READ_INPUT()
30. procedure READ_INPUT ()
31.永远做
31. do forever
32.当 input_available [ current_thread ] = FALSE时 执行 // 已更改
32. while input_available[current_thread] = FALSE do // CHANGED
33.收益率()//已改变
33. YIELD ( ) // CHANGED
34.继续 // 已更改
34. continue // CHANGED
35. PRINT_MSG ( msg [ current_thread ]) // 改为使用数组
35. PRINT_MSG (msg[current_thread]) // CHANGED to use array
36. input_available [ current_thread ] ← FALSE // 更改为使用数组
36. input_available[current_thread] ← FALSE // CHANGED to use array
Ben 启动 simplePC 平台并启动MAIN中运行的每个线程。两个线程正确地来回切换。Ben 暂时停止程序并观察以下堆栈:
Ben powers up the simplePC platform and starts each thread running in MAIN. The two threads switch back and forth correctly. Ben stops the program temporarily and observes the following stacks:
线程 0 的堆栈 线程 1 的堆栈
stack of thread 0 stack of thread 1
19 19
19 19
36 ← 堆栈指针 34 ← 堆栈指针
36 ← stack pointer 34 ← stack pointer
Q 10.5线程 0 正在运行(即current_thread = 0)。下次线程 0 执行YIELD中的返回指令后,处理器将运行哪条指令?
Q 10.5 Thread 0 was running (i.e., current_thread = 0). Which instruction will the processor be running after thread 0 executes the return instruction in YIELD the next time?
and which thread will be running?
Q 10.6 What address values can be on the stack of each thread?
A. Addresses of any instruction.
B. Addresses to which called procedures return.
Ben 观察到,股票行情程序中的每个线程大部分时间都在轮询其输入变量。他引入了一个显式过程,设备可以使用该过程通知线程。他还重新整理了代码以实现模块化:
Ben observes that each thread in the stock-ticker program spends most of its time polling its input variable. He introduces an explicit procedure that the devices can use to notify the threads. He also rearranges the code for modularity:
400.整数 状态[2];
400. integer state[2];
40.过程 SCHEDULE_AND_DISPATCH ()
40. procedure SCHEDULE_AND_DISPATCH ()
41.线程表[当前线程] ← SP
41. threadtable[current_thread] ← SP
42. while(这里应该放什么?)do //参见问题Q 10.7。
42. while (what should go here?) do // See question Q 10.7.
43. current_thread ← ( current_thread + 1)模2
43. current_thread ← (current_thread + 1) modulo 2
45. SP ←线程表[当前线程];
45. SP ← threadtable[current_thread];
46.返回
46. return
50.过程 YIELD ()
50. procedure YIELD()
51.状态[ current_thread ] ←等待
51. state[current_thread] ← WAITING
52.调度和调度()
52. SCHEDULE_AND_DISPATCH ()
53.返回
53. return
60.程序 通知(n)
60. procedure NOTIFY (n)
61.状态[ n ] ←可运行
61. state[n] ← RUNNABLE
62.返回
62. return
当输入设备收到新的股票报价时,设备会中断处理器,并将当前正在运行的线程的PC保存在当前正在运行的线程的堆栈上。然后处理器运行中断过程。当中断处理程序返回时,它会从当前堆栈中弹出返回地址,将控制权返回给线程。中断处理程序的伪代码为:
When the input device receives a new stock quote, the device interrupts the processor and saves the PC of the currently running thread on the currently running thread’s stack. Then the processor runs the interrupt procedure. When the interrupt handler returns, it pops the return address from the current stack, returning control to a thread. The pseudocode for the interrupt handler is:
程序 DEVICE ( n ) //输入设备n的中断
procedure DEVICE (n) // interrupt for input device n
将当前线程的PC推送到SP指向的堆栈上;
push current thread"s PC on stack pointed to by SP;
当 input_available [ n ] = TRUE时 不执行任何操作; // 等待直到 read_input 完成
while input_available[n] = TRUE do nothing; // wait until read_input is done
// 使用最后一个输入
// with the last input
msg [ n ] ← 股票报价
msg[n] ← stock quote
输入可用[ n ] ← TRUE
input_available[n] ← TRUE
NOTIFY ( n ) // 通知线程n
NOTIFY (n) // notify thread n
返回 // 即弹出PC
return // i.e., pop PC
在执行中断处理程序期间,中断被禁用。因此,中断处理程序及其调用的程序(例如NOTIFY)不能被中断。当DEVICE返回时,中断将重新启用。使用新的线程管理器回答以下问题:
During the execution of the interrupt handler, interrupts are disabled. Thus, an interrupt handler and the procedures that it calls (e.g., NOTIFY) cannot be interrupted. Interrupts are reenabled when DEVICE returns.Using the new thread manager, answer the following questions:
Q 10.7在地址 42 处的while中应评估什么表达式才能确保线程包正确运行?
Q 10.7 What expression should be evaluated in the while at address 42 to ensure correct operation of the thread package?
A. state[current_thread] = WAITING
B.状态[ current_thread ] = RUNNABLE
B. state[current_thread] = RUNNABLE
问 10.8假设线程 0 正在运行,而线程 1 未运行(即,它已调用YIELD)。线程 1 运行之前需要发生哪些事件?
Q 10.8 Assume thread 0 is running and thread 1 is not running (i.e., it has called YIELD). What event or events need to happen before thread 1 will run?
B. The interrupt procedure for input device 1 calls NOTIFY.
Q 10.9 What values can be on the stack of each thread?
A. Addresses of any instruction except those in the device driver interrupt procedure.
B. Addresses of all instructions, including those in the device driver interrupt procedure.
Q 10.10 Under which scenario can thread 0 deadlock?
A。当设备 0 在YIELD的第一个指令之前中断线程 0 时。
A. When device 0 interrupts thread 0 just before the first instruction of YIELD.
B.当线程 0 完成YIELD的第一条指令后设备 0 中断时。
B. When device 0 interrupts just after thread 0 completed the first instruction of YIELD.
C。当设备 0 在第 454 页的READ_INPUT过程中的指令 35 和 36 之间中断线程 0 时。
C. When device 0 interrupts thread 0 between instructions 35 and 36 in the READ_INPUT procedure on page 454.
D.当处理器正在执行SCHEDULE_AND_DISPATCH且线程 0 处于WAITING状态时,设备 0 发生中断。
D. When device 0 interrupts when the processor is executing SCHEDULE_AND_DISPATCH and thread 0 is in the WAITING state.
(第五章)
Ben Bitdiddle 对 Amazing Computer Company 的新型基于分段的计算机架构计划感到非常兴奋,因此他接受了公司提供给他的工作。
Ben Bitdiddle is so excited about Amazing Computer Company’s plans for a new segment-based computer architecture that he takes the job the company offered him.
Amazing Computer Company 发现,每个程序使用一个地址空间会将文本、数据、堆栈和系统库放在同一个地址空间中。例如,Web 服务器将 Web 服务器的程序文本(即二进制指令)、其内部数据结构(例如最近访问的网页的缓存)、堆栈以及用于发送和接收消息的系统库都放在一个地址空间中。Amazing Computer Company 希望探索如何通过使用新的内存系统分离文本、数据、堆栈和系统库来进一步加强模块化。
Amazing Computer Company has observed that using one address space per program puts the text, data, stack, and system libraries in the same address space. For example, a Web server has the program text (i.e., the binary instructions) for the Web server, its internal data structures such as its cache of recently-accessed Web pages, the stack, and a system library for sending and receiving messages all in a single address space. Amazing Computer Company wants to explore how to enforce modularity even further by separating the text, data, stack, and system library using a new memory system.
Amazing Computer Company 要求公司的每一位设计师都想出一个设计方案来进一步加强模块化。在一本关于 PDP-11/70 的尘封书籍中,Ben 找到了一个位于处理器和物理内存之间的硬件小工具的描述,该小工具可将虚拟地址转换为物理地址。PDP-11/70 使用该小工具允许每个程序拥有自己的地址空间,从地址 0 开始。
The Amazing Computer Company has asked every designer in the company to come up with a design to enforce modularity further. In a dusty book about the PDP-11/70, Ben finds a description of a hardware gadget that sits between the processor and the physical memory, translating virtual addresses to physical addresses. The PDP-11/70 used that gadget to allow each program to have its own address space, starting at address 0.
PDP-11/70 通过为每个程序设置一个段来实现这一点。从概念上讲,每个段都是一个可变大小的线性字节数组,从虚拟地址 0 开始。Ben 的内存系统基于 PDP-11/70 的方案,旨在实现硬模块化。Ben 通过段描述符定义一个段:
The PDP-11/70 did this through having one segment per program. Conceptually, each segment is a variable-sized, linear array of bytes starting at virtual address 0. Ben bases his memory system on the PDP-11/70’s scheme with the intention of implementing hard modularity. Ben defines a segment through a segment descriptor:
结构 段描述符
structure segmentDescriptor
物理地址 physAddr
physicalAddress physAddr
整数 长度
integer length
physAddr字段记录了该段在物理内存中的地址。length字段记录了该段的长度(以字节为单位)。
The physAddr field records the address in physical memory where the segment is located. The length field records the length of the segment in bytes.
Ben 的处理器的地址由 34 位组成:18 位用于标识段,16 位用于标识段内的字节:
Ben’s processor has addresses consisting of 34 bits: 18 bits to identify a segment and 16 bits to identify the byte within the segment:
| 段 ID | 指数 |
| 18 位 | 16 位 |
寻址段外字节(即大于段长度的索引)的虚拟地址是非法的。
A virtual address that addresses a byte outside a segment (i.e., an index greater than the length of the segment) is illegal.
Ben 的内存系统将段描述符存储在一个表 ( segmentTable)中,该表为每个段提供一个条目:
Ben’s memory system stores the segment descriptors in a table, segmentTable, which has one entry for each segment:
结构 段描述符
structure segmentDescriptor
段表[ NSEGMENT ]
segmentTable[NSEGMENT]
段表由segment_id索引。它在所有程序之间共享,存储在物理地址0处。
The segment table is indexed by segment_id. It is shared among all programs and stored at physical address 0.
Ben 的计算机使用的处理器是一个简单的 RISC 处理器,它使用LOAD和STORE指令读取和写入内存。LOAD和STORE指令以虚拟地址作为参数。Ben 的计算机有足够的内存,可以将所有程序放入物理内存中。
The processor used by Ben’s computer is a simple RISC processor, which reads and writes memory using LOAD and STORE instructions. The LOAD and STORE instructions take a virtual address as their argument. Ben’s computer has enough memory that all programs fit in physical memory.
Ben 移植了一个编译器,它可以将源程序翻译成用于处理器的机器指令。编译器将其翻译成与位置无关的机器代码:JUMP指令指定相对于程序计数器当前值的偏移量。为了调用另一个段,它支持LONGJUMP指令,该指令获取虚拟地址并跳转到该地址。
Ben ports a compiler that translates a source program to generate machine instructions for his processor. The compiler translates into a position-independent machine code: JUMP instructions specify an offset relative to the current value of the program counter. To make a call into another segment, it supports the LONGJUMP instruction, which takes a virtual address and jumps to it.
Ben 的内存系统使用TRANSLATE将虚拟地址转换为物理地址:
Ben’s memory system translates a virtual address to a physical address with TRANSLATE:
1 程序 TRANSLATE (地址)
1 procedure TRANSLATE (addr)
2 段 ID ←地址[0:17]
2 segment_id ← addr[0:17]
3 段←段表[段 ID ]
3 segment ← segmentTable[segment_id]
4 索引←地址[18:33]
4 index ← addr[18:33]
5 如果 index <段长度 ,则返回 段物理地址+索引
5 if index < segment.length then return segment.physAddr + index
6 … // 程序在这里应该做什么?(见下面的问题11.4)
6 … // What should the program do here? (see Q 11.4, below)
成功计算物理地址后,Ben 的内存管理单元从物理内存中检索寻址数据并将其传送给处理器(根据LOAD指令)或将数据存储在物理内存中(根据STORE指令)。
After successfully computing the physical address, Ben’s memory management unit retrieves the addressed data from physical memory and delivers it to the processor (on a LOAD instruction) or stores the data in physical memory (on a STORE instruction).
Q 11.1 What is the maximum sensible value of NSEGMENT?
Q 11.2给定虚拟地址的结构,段的最大大小是多少(以字节为单位)?
Q 11.2 Given the structure of a virtual address, what is the maximum size of a segment in bytes?
Q 11.3 How many bits wide must a physical address be?
Q 11.4 The missing code on line 6 should
A. signal the processor that the instruction that issued the memory reference has caused an illegal address fault
B. signal the processor that it should change to user mode
D. signal the processor that the instruction that issues the memory reference is an interrupt handler
Ben 修改了他的 Web 服务器,以加强服务器不同部分之间的模块化。他将程序文本分配在段 1 中,将最近使用的网页的缓存分配在段 2 中,将堆栈分配在段 3 中,将系统库分配在段 4 中。段 4 包含库程序的文本,但不包含变量(即,库程序不将变量存储在其自己的段中)。
Ben modifies his Web server to enforce modularity between the different parts of the server. He allocates the text of the program in segment 1, a cache for recently used Web pages in segment 2, the stack in segment 3, and the system library in segment 4. Segment 4 contains the text of the library program but no variables (i.e., the library program doesn’t store variables in its own segment).
Q 11.5为了翻译 Web 服务器,编译器必须执行下列哪项操作?
Q 11.5 To translate the Web server the compiler has to do which of the following?
A. Compute the physical address for each virtual address.
B. Include the appropriate segment ID in the virtual address used by a LOAD instruction.
C. Generate LONGJUMP instructions for calls to procedures located in different segments.
D. Include the appropriate segment ID in the virtual address used by a STORE instruction.
Ben 运行了基于段的 Web 服务器实现,惊讶地发现 Web 服务器程序中的错误会导致系统库的文本被覆盖。他研究了自己的设计,意识到这个设计很糟糕。
Ben runs the segment-based implementation of his Web server and to his surprise observes that errors in the Web server program can cause the text of the system library to be overwritten. He studies his design and realizes that the design is bad.
Q 11.6 Ben 的设计的哪些方面不好并且会导致观察到的行为?
Q 11.6 What aspect of Ben’s design is bad and can cause the observed behavior?
A. A STORE instruction can overwrite the segment ID of an address.
B.Web 服务器程序中的LONGJMP指令可能会跳转到库段中非过程起始的地址。
B. A LONGJMP instruction in the Web server program may jump to an address in the library segment that is not the start of a procedure.
C. It doesn’t allow for paging of infrequently used memory to a secondary storage device.
Q 11.7以下哪种 Ben 设计的扩展可以解决前面提到的每个问题?
Q 11.7 Which of the following extensions of Ben’s design would address each of the preceding problems?
A。处理器应该具有受保护的用户模式位,并且应该为内核和用户程序提供单独的段表。
A. The processor should have a protected user-mode bit, and there should be a separate segment table for kernel and user programs.
B.每个段描述符都应该有一个保护位,指定处理器是否可以写入该段还是只能从该段读取。
B. Each segment descriptor should have a protection bit, which specifies whether the processor can write or only read from this segment.
C。应更改LONGJMP指令,以便它只能将控制权转移到段的指定入口点。
C. The LONGJMP instruction should be changed so that it can transfer control only to designated entry points of a segment.
D.各个段应该具有相同的大小,就像基于页面的虚拟内存系统中的页面一样。
D. Segments should all be the same size, just like pages in page-based virtual memory systems.
E. Change the operating system to use a preemptive scheduler.
Ben 的 Web 服务器的系统库包含用于发送和接收消息的代码。一个单独的程序(网络管理器)管理发送和接收消息的网卡。Web 服务器和网络管理器各自有一个执行线程。Ben 想了解为什么他需要事件计数来协调网络管理器和 Web 服务器的顺序,因此他决定实现两次协调,一次使用事件计数,第二次使用事件变量。以下是 Ben 的两个版本的 Web 服务器:
The system library for Ben’s Web server contains code to send and receive messages. A separate program, the network manager, manages the network card that sends and receives messages. The Web server and the network manager each have one thread of execution. Ben wants to understand why he needs eventcounts for sequence coordination of the network manager and the Web server, so he decides to implement the coordination twice, once using eventcounts and the second time using event variables.Here are Ben’s two versions of the Web server:
使用事件计数的 Web 服务器 使用事件的 Web 服务器
Web server using eventcounts Web server using events
eventcount inCnt 事件 输入
eventcount inCnt event input
整数 doneCnt 整数 inCnt
integer doneCnt integer inCnt
整数 doneCnt
integer doneCnt
服务程序 ()服务 程序 ()
procedure serve () procedure SERVE ()
永远做 永远做
do forever do forever
AWAIT ( inCnt, doneCnt );当 inCnt ≤ doneCnt时 执行 //A
AWAIT (inCnt, doneCnt); while inCnt ≤ doneCnt do //A
DO_REQUEST(); WAITEVENT(输入);//B
DO_REQUEST (); WAITEVENT (Input); //B
doneCnt ← doneCnt +1;DO_REQUEST();//C
doneCnt ← doneCnt + 1; DO_REQUEST (); //C
这两个版本都使用第 5 章中描述的线程管理器,除了支持 eventcounts 或 events 的更改。eventcount版本与第 5 章中描述的完全相同。AWAIT过程具有 eventcounts 的语义:当 Web 服务器线程调用AWAIT时,线程管理器将调用线程置于WAITING状态,除非inCnt超过doneCnt。
Both versions use a thread manager as described in Chapter 5, except for the changes to support eventcounts or events. The eventcount version is exactly the one described in Chapter 5. The AWAIT procedure has semantics for eventcounts: when the Web server thread calls AWAIT, the thread manager puts the calling thread into the WAITING state unless inCnt exceeds doneCnt.
基于事件的版本与 eventcount 版本几乎相同,但有一些变化。事件变量是等待事件的线程列表。过程WAITEVENT将当前执行的线程放入事件列表中,记录当前线程处于WAITING状态,并通过调用YIELD释放处理器。
The event-based version is almost identical to the eventcount one but has a few changes. An event variable is a list of threads waiting for the event. The procedure WAITEVENT puts the current executing thread on the list for the event, records that the current thread is in the WAITING state, and releases the processor by calling YIELD.
在这两个版本中,当 Web 服务器完成处理数据包时,它会增加doneCnt。
In both versions, when the Web server has completed processing a packet, it increases doneCnt.
The two corresponding versions of the code for handling each packet arrival in the network manager are:
使用事件计数的网络管理器 使用事件的网络管理器
Network manager using eventcounts Network manager using events
前进(inCnt) inCnt ← inCnt + 1 //D
ADVANCE (inCnt) inCnt ← inCnt + 1 //D
NOTIFYEVENT(输入)//E
NOTIFYEVENT (input) //E
如果 Web 服务器线程已经处于休眠状态,则ADVANCE过程会将其唤醒。NOTIFYEVENT 过程会从事件列表中删除所有线程,并将它们置于READY状态。共享变量存储在网络管理器和 Web 服务器之间共享的段中。
The ADVANCE procedure wakes up the Web server thread if it is already asleep. The NOTIFYEVENT procedure removes all threads from the list of the event and puts them into the READY state. The shared variables are stored in a segment shared between the network manager and the Web server.
Ben 对编写涉及协调多项活动的代码有些担心,因此他决定仔细测试代码。他购买了一台只有一个处理器的计算机,使用抢占式线程调度程序来同时运行 Web 服务器和网络管理器。当处理器运行线程管理器的代码(包括ADVANCE、AWAIT、NOTIFYEVENT和WAITEVENT)时,Ben 会关闭中断,以确保两个线程(Web 服务器和网络管理器)永远不会同时在线程管理器内运行。
Ben is a bit worried about writing code that involves coordinating multiple activities, so he decides to test the code carefully. He buys a computer with one processor to run both the Web server and the network manager using a preemptive thread scheduler. Ben ensures that the two threads (the Web server and the network manager) never run inside the thread manager at the same time by turning off interrupts when the processor is running the thread manager’s code (which includes ADVANCE, AWAIT, NOTIFYEVENT, and WAITEVENT).
为了测试代码,Ben 将线程管理器更改为频繁抢占线程(即,每个线程以较短的时间片运行)。Ben 使用事件计数运行旧代码,程序运行正常,但使用事件的新代码存在问题,即 Web 服务器有时会延迟处理数据包,直到下一个数据包到达。
To test the code, Ben changes the thread manager to preempt threads frequently (i.e., each thread runs with a short time slice). Ben runs the old code with eventcounts and the program behaves as expected, but the new code using events has the problem that the Web server sometimes delays processing a packet until the next packet arrives.
Q 11.8可能导致问题的程序步骤在上面基于事件的解决方案的代码中用字母标记。使用这些字母,给出导致问题的步骤序列。(有些步骤可能必须出现多次,而有些步骤可能不是导致问题所必需的。)2002–1–4…11
Q 11.8 The program steps that might be causing the problem are marked with letters in the code of the event-based solution above. Using those letters, give a sequence of steps that creates the problem. (Some steps might have to appear more than once, and some might not be necessary to create the problem.)2002–1–4…11
(第五章)
使用信号量 down 和 up(参见边栏 5.7),Ben 实现了内核有界缓冲区,如下面的伪代码所示。内核维护一个port_infos数组。每个port_info包含一个有界缓冲区。消息结构的内容对于这个问题并不重要,除了它有一个字段dest_port,用于指定目标端口。当消息从网络到达时,它会产生中断,网络中断处理程序 ( INTERRUPT ) 会将消息放入消息中指定端口的有界缓冲区中。如果该有界缓冲区中没有空间,中断处理程序会丢弃该消息。线程通过调用RECEIVE_MESSAGE来使用消息,这会从它正在接收消息的端口的有界缓冲区中删除消息。
Using semaphores, down and up (see Sidebar 5.7), Ben implements an in-kernel bounded buffer as shown in the pseudocode below. The kernel maintains an array of port_infos. Each port_info contains a bounded buffer. The content of the message structure is not important for this problem, other than that it has a field dest_port, which specifies the destination port. When a message arrives from the network, it generates an interrupt, and the network interrupt handler (INTERRUPT) puts the message in the bounded buffer of the port specified in the message. If there is no space in that bounded buffer, the interrupt handler throws the message away. A thread consumes a message by calling RECEIVE_MESSAGE, which removes a message from the bounded buffer of the port it is receiving from.
为了协调中断处理程序和调用RECEIVE_MESSAGE 的线程,实现使用信号量。对于每个端口,内核都会保留一个信号量n,用于计算端口有界缓冲区中的消息数量。如果n达到 0,则在RECEIVE_MESSAGE中调用DOWN 的线程将进入 WAITING 状态。当INTERRUPT将消息添加到缓冲区时,它会在n上调用UP,这将唤醒线程(即,将线程的状态设置为 RUNNABLE)。
To coordinate the interrupt handler and a thread calling RECEIVE_MESSAGE, the implementation uses a semaphore. For each port, the kernel keeps a semaphore n that counts the number of messages in the port’s bounded buffer. If n reaches 0, the thread calling DOWN in RECEIVE_MESSAGE will enter the WAITING state. When INTERRUPT adds a message to the buffer, it calls UP on n, which will wake up the thread (i.e., set the thread’s state to RUNNABLE).
结构 port_info
structure port_info
信号量 实例 n 最初为0
semaphore instance n initially 0
消息 实例 缓冲区[ NMSG ] // 消息数组
message instance buffer [NMSG] // an array of messages
长整数 最初 为0
long integer in initially 0
长整数 最初 为0
long integer out initially 0
过程 INTERRUPT(消息 实例 m,port_info 参考 端口)
procedure INTERRUPT (message instance m, port_info reference port)
// 宣布消息m到达的中断
// an interrupt announcing the arrival of message m
如果 port.in − port.out ≥ NMSG 则 // 有空间吗?
if port.in − port.out ≥ NMSG then // is there space?
返回 // 否,忽略消息
return // No, ignore message
port.buffer [ port.in 模数 NMSG ] ← m
port.buffer [port.in modulo NMSG] ← m
端口.in ←端口.in + 1
port.in ← port.in + 1
UP (港口.入)
UP (port.in)
程序 RECEIVE_MESSAGE(port_info 参考 端口)
procedure RECEIVE_MESSAGE (port_info reference port)
1 … // 这里将添加另一行代码
1 … // another line of code will go here
关闭( port.in )
DOWN (port.in)
m ← port.buffer [ port.in 模数 NMSG ]
m ← port.buffer[port.in modulo NMSG]
端口输出←端口输出+ 1
port.out ← port.out + 1
返回 m
return m
内核抢先地调度线程。
The kernel schedules threads preemptively.
Q 12.1假设在同一端口上没有并发调用INTERRUPT并且没有并发调用RECEIVE_MESSAGE。下列哪项关于INTERRUPT和RECEIVE_MESSAGE实现的陈述是正确的?
Q 12.1 Assume that there are no concurrent invocations of INTERRUPT and that there are no concurrent invocations of RECEIVE_MESSAGE on the same port. Which of the following statements is true about the implementation of INTERRUPT and RECEIVE_MESSAGE?
A。在不同端口上同时调用RECEIVE_MESSAGE的两个线程之间不存在竞争条件。
A. There are no race conditions between two threads that invoke RECEIVE_MESSAGE concurrently on different ports.
B.INTERRUPT中UP的完整执行将不会在侧栏 5.7中DOWN中标号为15和16 的语句之间交错。
B. The complete execution of UP in INTERRUPT will not be interleaved between the statements labeled 15 and 16 in DOWN in Sidebar 5.7.
C。因为DOWN和UP是原子的,所以DOWN中减去sem和UP中加sem所需的处理器指令不会错误地交错。
C. Because DOWN and UP are atomic, the processor instructions necessary for the subtracting of sem in DOWN and adding to sem in UP will not be interleaved incorrectly.
D.由于输入和输出可能在运行INTERRUPT 的中断处理程序和在同一端口上调用RECEIVE_MESSAGE 的线程之间共享,因此INTERRUPT可能会丢弃消息,即使在有界缓冲区中有空间。
D. Because in and out may be shared between the interrupt handler running INTERRUPT and a thread calling RECEIVE_MESSAGE on the same port, it is possible for INTERRUPT to throw away a message, even though there is space in the bounded buffer.
Alyssa 声称信号量也可用于使操作原子化。她建议在port_info结构中添加以下内容:
Alyssa claims that semaphores can also be used to make operations atomic. She proposes the following addition to a port_info structure:
信号量 实例 互斥 最初???? // 参见下面的问题
semaphore instance mutex initially ???? // see question below
并在上面的伪代码的第 1 行的 accept_message 中添加以下行:
and adds the following line to receive_message, on line 1 in the pseudocode above:
DOWN ( port . mutex ) // 进入原子部分
DOWN(port.mutex) // enter atomic section
Alyssa 认为,这些变化允许线程在同一端口上同时调用RECEIVE ,而不会出现竞争条件,即使内核预先安排线程。
Alyssa argues that these changes allow threads to concurrently invoke RECEIVE on the same port without race conditions, even if the kernel schedules threads preemptively.
问 12.2当多个线程在同一端口上调用RECEIVE_MESSAGE时,可以将互斥锁初始化为什么值(通过在信号量声明中用数字替换 ???? )以避免竞争条件和死锁?
Q 12.2 To what value can mutex be initialized (by replacing ???? with a number in the semaphore declaration) to avoid race conditions and deadlocks when multiple threads call RECEIVE_MESSAGE on the same port?
(第五章)
Ben Bitdiddle 计划利用他刚刚开发的 15 美元单芯片网络计算机 NC 来掀起计算领域的一场革命。在 NC 网络系统中,当消息到达时,网络接口线程会调用过程MESSAGE_ARRIVED 。线程可以调用过程WAIT_FOR_MESSAGE来等待消息。为了协调线程执行的顺序,Ben 部署了另一个常用的协调原语:条件变量。
Ben Bitdiddle plans to create a revolution in computing with his just-developed $15 single-chip Network Computer, NC. In the NC network system, the network interface thread calls the procedure MESSAGE_ARRIVED when a message arrives. The procedure WAIT_FOR_MESSAGE can be called by a thread to wait for a message. To coordinate the sequences in which threads execute, Ben deploys another commonly used coordination primitive: condition variables.
NC中部分代码如下:
Part of the code in the NC is as follows:
1 锁 实例 m
1 lock instance m
2 布尔值 message_here
2 boolean message_here
3 条件 实例 message_present
3 condition instance message_present
4
4
5 过程 MESSAGE_ARRIVED ()
5 procedure MESSAGE_ARRIVED ()
6 message_here ←真实
6 message_here ← TRUE
7 NOTIFY_CONDITION ( message_present ) // 通知等待此条件的线程
7 NOTIFY_CONDITION (message_present) // notify threads waiting on this condition
8
8
9 过程 WAIT_FOR_MESSAGE ()
9 procedure WAIT_FOR_MESSAGE ()
10 获得(米)
10 ACQUIRE (m)
11 虽然不是 message_here 做
11 while not message_here do
12 WAIT_CONDITION ( message_present , m ); // 释放m并等待
12 WAIT_CONDITION (message_present, m); // release m and wait
13 释放(米)
13 RELEASE (m)
过程ACQUIRE和RELEASE在第 5 章中描述。NOTIFY_CONDITION ( condition ) 原子地唤醒所有等待condition变为TRUE 的线程。WAIT_CONDITION ( condition , lock ) 原子地执行几件事:它测试condition;如果为 TRUE 则返回;否则它将调用线程放在等待 condition 的队列中并释放锁。当 NOTIFY_CONDITION唤醒一个线程时,该线程将变为可运行状态,并且当调度程序运行该线程时,WAIT_CONDITION会重新获取锁(如果需要,则等待,直到可用)然后再返回给其调用者。
The procedures ACQUIRE and RELEASE are the ones described in Chapter 5. NOTIFY_CONDITION (condition) atomically wakes up all threads waiting for condition to become TRUE. WAIT_CONDITION (condition, lock) does several things atomically: it tests condition; if TRUE it returns; otherwise it puts the calling thread on the waiting queue for condition and releases lock. When NOTIFY_CONDITION wakens a thread, that thread becomes runnable, and when the scheduler runs that thread, WAIT_CONDITION reacquires lock (waiting, if necessary, until it is available) before returning to its caller.
假设条件变量的实现中没有错误。
Assume there are no errors in the implementation of condition variables.
Q 13.1 WAIT_FOR_MESSAGE可能会永远等待,即使在while循环中旋转时有消息到达。给出导致此问题的上述语句的执行顺序。您的答案应该是一个简单的列表,例如 1、2、3、4。
Q 13.1 It is possible that WAIT_FOR_MESSAGE will wait forever even if a message arrives while it is spinning in the while loop. Give an execution ordering of the above statements that would cause this problem. Your answer should be a simple list such as 1, 2, 3, 4.
Q 13.2编写新版本的MESSAGE_ARRIVED和/或WAIT_FOR_MESSAGE来修复此问题。1998–1–3a/b
Q 13.2 Write new version(s) of MESSAGE_ARRIVED and/or WAIT_FOR_MESSAGE to fix this problem.1998–1–3a/b
(第 5 章和 第 7 章[在线])
(Chapters 5 and 7 [on-line])
Louis P. Hacker 在一次庭院拍卖会上以 14.99 美元的价格购买了一台二手 Therac-25(涉及多起致命事故的医疗放射机 - 请参阅进一步阅读建议 1.9.5)。经过一些细微的修改后,他将其连接到家庭网络,成为一个计算机可控制的涡轮烤面包机,可以在 2 毫秒内烘烤一片面包。他决定使用 RPC 来控制 Toastac-25。每个烘烤请求都会在服务器上启动一个新线程,该线程会烘烤面包,返回确认(或者可能是有用的错误代码,例如“故障 54”),然后退出。每个服务器线程都运行以下过程:
Louis P. Hacker bought a used Therac-25 (the medical irradiation machine that was involved in several fatal accidents—see Suggestions for Further Reading 1.9.5) for $14.99 at a yard sale. After some slight modifications, he has hooked it up to his home network as a computer-controllable turbo-toaster, which can toast one slice in under 2 milliseconds. He decides to use RPC to control the Toastac-25. Each toasting request starts a new thread on the server, which cooks the toast, returns an acknowledgment (or perhaps a helpful error code, such as “Malfunction 54”), and exits. Each server thread runs the following procedure:
过程 服务器(){
procedure SERVER ( ) {
获取(消息缓冲区锁)
ACQUIRE (message_buffer_lock)
解码(消息)
DECODE (message)
获取(加速器缓冲区锁)
ACQUIRE (accelerator_buffer_lock)
释放(消息缓冲区锁)
RELEASE (message_buffer_lock)
烘烤()
COOK_TOAST ()
获取(消息缓冲区锁)
ACQUIRE (message_buffer_lock)
消息← “ack”
message ← “ack”
发信息)
SEND (message)
释放(加速器缓冲区锁)
RELEASE (accelerator_buffer_lock)
释放(消息缓冲区锁)
RELEASE (message_buffer_lock)
Q 14.1令他惊讶的是,第一次大量使用时,烤面包机就停止烘烤了!出了什么问题?
Q 14.1 To his surprise, the toaster stops cooking toast the first time it is heavily used! What has gone wrong?
A。两个服务器线程可能会死锁,因为一个线程具有message_buffer_lock并且需要accelerate_buffer_lock,而另一个线程具有accelerate_buffer_lock并且需要message_buffer_lock。
A. Two server threads might deadlock because one has message_buffer_lock and wants accelerator_buffer_lock, while the other has accelerator_buffer_lock and wants message_buffer_lock.
B.两个服务器线程可能会死锁,因为其中一个具有accelerate_buffer_lock和message_buffer_lock。
B. Two server threads might deadlock because one has accelerator_buffer_lock and message_buffer_lock.
C。Toastac-25 死锁,因为COOK_TOAST不是原子操作。
C. Toastac-25 deadlocks because COOK_TOAST is not an atomic operation.
D. Insufficient locking allows inappropriate interleaving of server threads.
Louis 修复了多线程服务器后,Toastac 的使用率比以前更高了。但是,当 Toastac 有许多并发请求(即有许多线程)时,他注意到系统性能严重下降 — 远远超出他的预期。性能分析表明,锁竞争不是问题所在。
Once Louis fixes the multithreaded server, the Toastac gets more use than ever. However, when the Toastac has many simultaneous requests (i.e., there are many threads), he notices that the system performance degrades badly—much more than he expected. Performance analysis shows that competition for locks is not the problem.
Q 14.2 What is probably going wrong?
A。Toastac 系统将所有时间都花在线程之间的上下文切换上。
A. The Toastac system spends all its time context switching between threads.
B. The Toastac system spends all its time waiting for requests to arrive.
C. The Toastac gets hot, and therefore cooking toast takes longer.
Q 14.3升级到超级计算机解决了这个问题,但为时已晚——Louis 痴迷于性能。他从 RPC 切换到异步协议,如果多个请求在 2 毫秒内发出,该协议会将它们分组为一条消息。在他的网络上,传输时间非常长,他注意到这可以使某些工作负载的速度远远高于其他工作负载。描述一个加速的工作负载和一个没有加速的工作负载。(可能的工作负载的一个例子是每 10 毫秒一个请求。)
Q 14.3 An upgrade to a supercomputer fixes that problem, but it’s too late—Louis is obsessed with performance. He switches from RPC to an asynchronous protocol, which groups several requests into a single message if they are made within 2 milliseconds of one another. On his network, which has a very high transit time, he notices that this speeds up some workloads far more than others. Describe a workload that is sped up and a workload that is not sped up. (An example of a possible workload would be one request every 10 milliseconds.)
Q 14.4作为一名设计工程顾问,你被要求批评 Louis 从 RPC 转向异步客户端/服务的决定。你对他的决定有什么看法?请记住,Toastac 软件有时会出现故障,并显示“故障 54”,而不是正常运行。1996–1–5c/d 和 1999–1–12/13
Q 14.4 As a design engineering consultant, you are called in to critique Louis’s decision to move from RPC to asynchronous client/service. How do you feel about his decision? Remember that the Toastac software sometimes fails with a “Malfunction 54” instead of toasting properly.1996–1–5c/d & 1999–1–12/13
(Chapters 5 and 6)
Ben Bitdiddle 编写了大量面向对象的程序。对象的大小各不相同,但页面的大小是固定的。Ben 受到启发,将他的基于页面的虚拟内存系统 (PAGE) 重新设计为对象内存系统。PAGE 是一个基于页面的虚拟内存系统,类似于第 5 章中描述的系统,并带有第 6 章中对多级内存系统的扩展。BOOZE 是 Ben 的基于对象的虚拟内存系统。*当然,他可以在任一系统上运行他的程序。
Ben Bitdiddle writes a large number of object-oriented programs. Objects come in different sizes, but pages come in a fixed size. Ben is inspired to redesign his page-based virtual memory system (PAGE) into an object memory system. PAGE is a page-based virtual memory system like the one described in Chapter 5 with the extensions for multilevel memory systems from Chapter 6. BOOZE is Ben’s object-based virtual memory system.* Of course, he can run his programs on either system.
每个 BOOZE 对象都有一个唯一 ID,称为 UID。UID 有三个字段:包含该对象的磁盘块的磁盘地址;该对象在该磁盘块内的起始偏移量;以及对象的大小。
Each BOOZE object has a unique ID called a UID. A UID has three fields: a disk address for the disk block that contains the object; an offset within that disk block where the object starts; and the size of the object.
结构 uid
structure uid
整数 blocknr //磁盘块的磁盘地址
integer blocknr // disk address for disk block
整数 偏移量 // 块内的偏移量blocknr
integer offset // offset within block blocknr
整数 大小 //对象的大小
integer size // size of object
在 BOOZE 和 PAGE 上运行的应用程序具有类似的结构。唯一的区别是在 PAGE 上,程序通过对象的虚拟地址引用对象,而在 BOOZE 上,程序通过 UID 引用对象。
Applications running on BOOZE and PAGE have similar structure. The only difference is that on PAGE, program refer to objects by their virtual address, while on BOOZE programs refer to objects by UIDs.
BOOZE 和 PAGE 中的两个内存级别是主内存和磁盘。磁盘是 4 千字节固定大小块的线性阵列。磁盘块通过其块号寻址。在这两个系统中,磁盘和主内存之间的传输单位都是 4 千字节块。对象不会跨越磁盘块边界,小于 4 千字节,并且不能改变大小。PAGE 中的页面大小等于磁盘块大小;因此,当应用程序引用对象时,PAGE 将引入同一页面上的所有对象。
The two levels of memory in BOOZE and PAGE are main memory and disk. The disk is a linear array of fixed-size blocks of 4 kilobytes. A disk block is addressed by its block number. In both systems, the transfer unit between the disk and main memory is a 4-kilobyte block. Objects don’t cross disk block boundaries, are smaller than 4 kilobytes, and cannot change size. The page size in PAGE is equal to the disk block size; therefore, when an application refers to an object, PAGE will bring in all objects on the same page.
BOOZE 在主内存中保存一个对象映射。对象映射包含将 UID 映射到相应对象的内存地址的条目。
BOOZE keeps an object map in main memory. The object map contains entries that map a UID to the memory address of the corresponding object.
结构 图录入
structure mapentry
uid 实例 UID
uid instance UID
整数 地址
integer addr
对于对某个对象的所有引用,BOOZE 都会将 UID 转换为主内存中的地址。BOOZE 使用以下过程(部分由硬件实现,部分由软件实现)进行转换:
On all references to an object, BOOZE translates a UID to an address in main memory. BOOZE uses the following procedure (implemented partly in hardware and partly in software) for translation:
过程 OBJECTTOADDRESS ( UID )返回 地址
procedure OBJECTTOADDRESS(UID) returns address
addr ← ISPRESENT ( UID ) // UID是否存在于对象映射中?
addr ← ISPRESENT(UID) // is UID present in object map?
如果 addr ≥ 0则返回 addr // UID存在,返回addr
if addr ≥ 0 then return addr // UID is present, return addr
addr ← FINDFREESPACE ( UID.size ) // 分配空间来保存对象
addr ← FINDFREESPACE(UID.size) // allocate space to hold object
READOBJECT ( addr , UID ) // 从磁盘读取对象并存储在addr 处
READOBJECT(addr, UID) // read object from disk & store at addr
ENTERINTOMAP ( UID,addr ) //在对象映射中输入UID
ENTERINTOMAP(UID, addr) // enter UID in object map
return addr //返回对象的内存地址
return addr // return memory address of object
ISPRESENT在对象映射中查找UID ;如果存在,则返回相应对象的地址;否则,返回 1。FINDFREESPACE为对象分配可用空间;它可能会驱逐另一个对象,以便为该对象腾出空间。READOBJECT读取包含该对象的页面,然后将该对象复制到分配的地址。
ISPRESENT looks up UID in the object map; if present, it returns the address of the corresponding object; otherwise, it returns 1. FINDFREESPACE allocates free space for the object; it might evict another object to make space available for this one. READOBJECT reads the page that contains the object, and then copies the object to the allocated address.
Q 15.1 mapentry数据结构中的addr表示什么?
Q 15.1 What does addr in the mapentry data structure denote?
A. The memory address at which the object map is located.
B. The disk address at which to find a given object.
C. The memory address at which to find a given object that is currently resident in memory.
D. The memory address at which a given non-resident object would have to be loaded, when an access is made to it.
Q 15.2 In what way is BOOZE better than PAGE?
A。在 BOOZE 上运行的应用程序通常使用较少的主内存,因为 BOOZE 只存储正在使用的对象。
A. Applications running on BOOZE generally use less main memory because BOOZE stores only objects that are in use.
B.在 BOOZE 上运行的应用程序通常运行速度更快,因为 UID 比虚拟地址小。
B. Applications running on BOOZE generally run faster because UIDs are smaller than virtual addresses.
C。在 BOOZE 上运行的应用程序通常运行速度更快,因为 BOOZE 将对象从磁盘传输到主内存而不是完整的页面。
C. Applications running on BOOZE generally run faster because BOOZE transfers objects from disk to main memory instead of complete pages.
D.在 BOOZE 上运行的应用程序通常运行速度更快,因为典型的应用程序将表现出更好的引用局部性。
D. Applications running on BOOZE generally run faster because typical applications will exhibit better locality of reference.
当FINDFREESPACE找不到足够的空间来容纳对象时,它需要将一个或多个对象写回到磁盘以腾出可用空间。FINDFREESPACE使用WRITEOBJECT将对象写入磁盘。Ben 正在研究如何实现WRITEOBJECT。他正在考虑以下选项:
When FINDFREESPACE cannot find enough space to hold the object, it needs to write one or more objects back to the disk to create free space. FINDFREESPACE uses WRITEOBJECT to write an object to the disk.Ben is figuring out how to implement WRITEOBJECT. He is considering the following options:
1.过程 WRITEOBJECT (地址,UID ) 写入(地址,UID.块号,4096)
1. procedure WRITEOBJECT (addr, UID) WRITE(addr, UID.blocknr, 4096)
2.过程 WRITEOBJECT (地址,UID ) 读取(缓冲区,UID.blocknr,4096) COPY(地址,缓冲区+ UID.偏移量,UID.大小) 写入(缓冲区,UID.块号,4096)
2. procedure WRITEOBJECT(addr, UID) READ(buffer, UID.blocknr, 4096) COPY(addr, buffer + UID.offset, UID.size) WRITE(buffer, UID.blocknr, 4096)
READ ( mem_addr , disk_addr , 4096) 和WRITE ( mem_addr , disk_addr , 4096) 从磁盘读取和写入 4 KB 页面。COPY ( source , destination , size ) 将size个字节从源地址复制到主内存中的目标地址。
READ (mem_addr, disk_addr, 4096) and WRITE (mem_addr, disk_addr, 4096) read and write a 4-kilobyte page from/to the disk. COPY (source, destination, size) copies size bytes from a source address to a destination address in main memory.
Q 15.3 Which implementation should Ben use?
A. Implementation 2, since implementation 1 is incorrect.
B. Implementation 1, since it is more efficient than implementation 2.
C. Implementation 1, since it is easier to understand.
D. Implementation 2, since it will result in better locality of reference.
Ben 现在将注意力转向优化 BOOZE 的性能。特别是,他希望减少写入磁盘的次数。
Ben now turns his attention to optimizing the performance of BOOZE. In particular, he wants to reduce the number of writes to the disk.
Q 15.4 Which of the following techniques will reduce the number of writes without losing correctness?
A. Prefetching objects on a read.
B. Delaying writes to disk until the application finishes its computation.
C. Writing to disk only objects that have been modified.
D. Delaying a write of an object to disk until it is accessed again.
Ben 决定他想要更好的性能,因此他决定修改FINDFREESPACE。当FINDFREESPACE必须驱逐一个对象时,它现在会尝试不写入在过去 30 秒内修改过的对象(相信它可能很快会再次使用)。Ben 通过在修改对象时设置脏标志来实现这一点。每 30 秒,BOOZE 调用一个过程WRITE_BEHIND,该过程遍历对象图并写出所有脏对象。写入对象后,WRITE_BEHIND会清除其脏标志。当FINDFREESPACE需要驱逐一个对象以便为另一个对象腾出空间时,干净的对象是唯一的替换候选对象。
Ben decides that he wants even better performance, so he decides to modify FINDFREESPACE. When FINDFREESPACE has to evict an object, it now tries not to write an object modified in the last 30 seconds (in the belief that it may be used again soon). Ben does this by setting the dirty flag when the object is modified. Every 30 seconds, BOOZE calls a procedure WRITE_BEHIND that walks through the object map and writes out all objects that are dirty. After an object has been written, WRITE_BEHIND clears its dirty flag. When FINDFREESPACE needs to evict an object to make space for another, clean objects are the only candidates for replacement.
当在最新版本的 BOOZE 上运行他的应用程序时,Ben 偶尔会发现 BOOZE 在为新对象调用OBJECTTOADDRESS时会耗尽物理内存。
When running his applications on the latest version of BOOZE, Ben observes once in a while that BOOZE runs out of physical memory when calling OBJECTTOADDRESS for a new object.
Q 15.5 Which of these strategies avoids the above problem?
A。当FINDFREESPACE找不到任何干净的对象时,它会调用WRITE_BEHIND,然后再次尝试查找干净的对象。
A. When FINDFREESPACE cannot find any clean objects, it calls WRITE_BEHIND and then tries to find clean objects again.
B.BOOZE 可以每秒调用一次 WRITE_BEHIND ,而不是每 30 秒调用一次。
B. BOOZE could call WRITE_BEHIND every second instead of every 30 seconds.
C。当FINDFREESPACE找不到任何干净的对象时,它会选择一个脏对象,将包含该对象的块写入磁盘,清除脏标志,然后将该地址用于新对象。
C. When FINDFREESPACE cannot find any clean objects, it picks one dirty object, writes the block containing the object to the disk, clears the dirty flag, and then uses that address for the new object.
(Chapter 6, with a bit of Chapter 4)
OutOfMoney.com决定它需要一款真正的产品,因此它正在解雇营销部门的大部分人员。为了替换营销人员,在一位资深计算机专家的建议下,OutOfMoney.com聘请了一群 16 岁的年轻人。这些 16 岁的年轻人聚在一起,决定设计和实施一种提供 MPEG-1 视频的视频服务,这样他们就可以在计算机上以逼真的色彩观看布兰妮·斯皮尔斯的歌曲。
OutOfMoney.com has decided it needs a real product, so it is laying off most of its Marketing Department. To replace the marketing folks, and on the advice of a senior computer expert, OutOfMoney.com hires a crew of 16-year-olds. The 16-year-olds get together and decide to design and implement a video service that serves MPEG-1 video, so that they can watch Britney Spears on their computers in living color.
由于上市时间至关重要,Mark Bitdiddle(Ben 16 岁的弟弟,目前在 OutOfMoney 工作)在网上搜索了一些可以作为起点的代码。Mark 找到了一些看起来相关的代码,并根据 OutOfMoney 的视频服务对其进行了修改:
Since time to market is crucial, Mark Bitdiddle—Ben’s 16-year-old kid brother, who is working for OutOfMoney—surfs the Web to find some code from which they can start. Mark finds some code that looks relevant, and he modifies it for OutOfMoney’s video service:
服务流程()
procedureSERVICE ()
永远做
do forever
请求← RECEIVE_MESSAGE ()
request ← RECEIVE_MESSAGE ()
文件← GET_FILE_FROM_DISK(请求)
file ← GET_FILE_FROM_DISK (request)
回复(文件)
REPLY (file)
SERVICE过程等待来自客户端的消息到达网络。该消息包含对特定文件的请求。GET_FILE_FROM_DISK过程将文件从磁盘读入内存位置文件。REPLY过程将文件从内存通过消息发送回客户端。
The SERVICE procedure waits for a message from a client to arrive on the network. The message contains a request for a particular file. The procedure GET_FILE_FROM_DISK reads the file from disk into the memory location file. The procedure REPLY sends the file from memory in a message back to the client.
(在伪代码中,未声明的变量是使用它们的过程的局部变量,因此变量存储在堆栈或寄存器中。)
(In the pseudocode, undeclared variables are local variables of the procedure in which they are used, and the variables are thus stored on the stack or in registers.)
马克和他 16 岁的伙伴们还编写了网络驱动程序代码,用于发送和接收网络数据包,编写了简单的文件系统代码,用于将文件放入磁盘并获取文件,编写了加载程序代码,用于启动计算机。他们在现成的个人计算机的裸机上运行代码,该个人计算机有一块磁盘、一个处理器(奔腾 III)和一个网络接口卡(每秒 1 千兆位以太网)。计算机启动后,会启动一个运行SERVICE 的线程。
Mark and his 16-year-old buddies also write code for a network driver to SEND and RECEIVE network packets, a simple file system to PUT and GET files on a disk, and a loader for booting a machine. They run their code on the bare hardware of an off-the-shelf personal computer with one disk, one processor (a Pentium III), and one network interface card (1 gigabit per second Ethernet). After the machine has booted, it starts one thread running SERVICE.
该磁盘的平均寻道时间为 5 毫秒,旋转一圈需要 6 毫秒,当不需要寻道时,其吞吐量为每秒 10 兆字节。
The disk has an average seek time of 5 milliseconds, a complete rotation takes 6 milliseconds, and its throughput is 10 megabytes per second when no seeks are required.
所有文件都是 1 GB(大约是半小时的 MPEG-1 视频)。存储文件的文件系统没有缓存,它以 8 KB 的块为文件分配数据。它在分配块时不考虑文件布局;因此,同一文件的磁盘块可能遍布整个磁盘。1 GB 的文件包含 131,072 个 8 KB 的块。
All files are 1 gigabyte (roughly a half hour of MPEG-1 video). The file system in which the files are stored has no cache, and it allocates data for a file in 8-kilobyte chunks. It pays no attention to file layout when allocating a chunk; as a result, disk blocks of the same file can be all over the disk. A 1-gigabyte file contains 131,072 8-kilobyte blocks.
Q 16.1假设磁盘是主要瓶颈,服务需要多长时间来处理一个文件?
Q 16.1 Assuming that the disk is the main bottleneck, how long does the service take to serve a file?
Mark 对性能感到震惊。Ben 建议他们应该添加缓存。Mark 对 Ben 的知识印象深刻,听从了他的建议,添加了 1 GB 的缓存,可以完全容纳一个文件:
Mark is shocked about the performance. Ben suggests that they should add a cache. Mark, impressed by Ben’s knowledge, follows his advice and adds a 1-gigabyte cache, which can hold one file completely:
cache [1073741824] // 1 GB 缓存
cache [1073741824] // 1-gigabyte cache
服务流程()
procedure SERVICE ()
永远做
do forever
请求← RECEIVE_MESSAGE ()
request ← RECEIVE_MESSAGE ()
文件← LOOK_IN_CACHE (请求)
file ← LOOK_IN_CACHE (request)
如果 文件= NULL 那么
if file = NULL then
文件← GET_FILE_FROM_DISK(请求)
file ← GET_FILE_FROM_DISK (request)
ADD_TO_CACHE(请求,文件)
ADD_TO_CACHE (request, file)
回复(文件)
REPLY (file)
过程LOOK_IN_CACHE检查请求中指定的文件是否存在于缓存中,如果存在则返回该文件。过程ADD_TO_CACHE将文件复制到缓存中。
The procedure LOOK_IN_CACHE checks whether the file specified in the request is present in the cache and returns it if present. The procedure ADD_TO_CACHE copies a file to the cache.
Q 16.2 Mark 通过对每个存储的视频进行一次询问来测试代码。假设磁盘是主要瓶颈(从缓存提供文件需要 0 毫秒),那么现在服务提供文件的平均时间是多少?
Q 16.2 Mark tests the code by asking once for every video stored. Assuming that the disk is the main bottleneck (serving a file from the cache takes 0 milliseconds), what now is the average time for the service to serve a file?
测试确实返回了所有视频,马克对此感到很高兴。他向营销部剩下的唯一一个人汇报说,原型已准备好进行评估。为了让投资者满意,营销人员决定使用原型来运行 OutOfMoney 的网站。营销部只有一个人,将视频加载到机器中,并通过大型公关活动启动了新网站,浪费了他们剩余的资金。
Mark is happy that the test actually returns every video. He reports back to the only person left in the Marketing Department that the prototype is ready to be evaluated. To keep the investors happy, the marketing person decides to use the prototype to run OutOfMoney’s Web site. The one-person Marketing Department loads the machine up with videos and launches the new Web site with a big PR campaign, blowing their remaining funding.
在他们启动网站几秒钟后,OutOfMoney 的支持部门(也是由 16 岁的孩子组成)收到了不满意用户的电子邮件,称该服务没有响应他们的请求。支持部门测量了服务 CPU 和服务磁盘的负载。他们观察到 CPU 负载很低,而磁盘负载很高。
Seconds after they launch the Web site, OutOfMoney’s support organization (also staffed by 16-year-olds) receives e-mail from unhappy users saying that the service is not responding to their requests. The support department measures the load on the service CPU and also the service disk. They observe that the CPU load is low and the disk load is high.
Q 16.3 What is the most likely reason for this observation?
支持部门呼叫了 Mark,Mark 跑去向他的兄弟 Ben 求助。Ben 建议使用第 5 章的示例线程包。Mark 扩充了代码以使用线程包,系统启动后,它启动了 100 个线程,每个线程都运行SERVICE:
The support department beeps Mark, who runs to his brother Ben for help. Ben suggests using the example thread package of Chapter 5. Mark augments the code to use the thread package and after the system boots, it starts 100 threads, each running SERVICE:
对于 i 从1到100执行 CREATE_THREAD ( SERVICE )
for i from 1 to 100 do CREATE_THREAD (SERVICE)
此外,mark 修改了RECEIVE_MESSAGE和GET_FILE_FROM_DISK,在等待新消息到达或等待磁盘完成磁盘读取时通过调用YIELD来释放处理器。他的代码没有在其他地方释放处理器。线程包的实现是非抢占式的。
In addition, mark modifies RECEIVE_MESSAGE and GET_FILE_FROM_DISK to release the processor by calling YIELD when waiting for a new message to arrive or waiting for the disk to complete a disk read. In no other place does his code release the processor. The implementation of the thread package is non-preemptive.
为了利用线程实现,Mark 修改了代码以读取文件块而不是完整文件。他还跑到商店购买了更多内存,以便将缓存大小增加到 4 GB。以下是他的最新成果:
To take advantage of the threaded implementation, Mark modifies the code to read blocks of a file instead of complete files. He also runs to the store and buys some more memory so he can increase the cache size to 4 gigabytes. Here is his latest effort:
cache [4 ´ 1073741824] // 4 GB 的缓存,由所有线程共享。
cache [4 ´ 1073741824] // The 4-gigabyte cache, shared by all threads.
服务流程()
procedure SERVICE ()
永远做
do forever
请求← RECEIVE_MESSAGE ()
request ← RECEIVE_MESSAGE ()
文件← NULL
file ← NULL
对于 k 从1到131072执行
for k from 1 to 131072 do
块← LOOK_IN_CACHE (请求, k )
block ← LOOK_IN_CACHE (request, k)
如果 block = NULL 那么
if block = NULL then
块← GET_BLOCK_FROM_DISK (请求, k )
block ← GET_BLOCK_FROM_DISK (request, k)
ADD_TO_CACHE (请求,块, k )
ADD_TO_CACHE (request, block, k)
文件←文件+块 // + 连接字符串
file ← file + block // + concatenates strings
回复(文件)
REPLY (file)
过程LOOK_IN_CACHE ( request , k ) 检查request中指定的文件的块k是否存在;如果存在,则返回该块。过程GET_BLOCK_FROM_DISK将request中指定的文件的块k从磁盘读入内存。过程ADD_TO_CACHE将request中指定的文件的块k添加到缓存中。
The procedure LOOK_IN_CACHE (request, k) checks whether block k of the file specified in request is present; if the block is present, it returns it. The procedure GET_BLOCK_FROM_DISK reads block k of the file specified in request from the disk into memory. The procedure ADD_TO_CACHE adds block k from the file specified in request to the cache.
Mark 用一个视频加载了该服务。他成功检索了该视频。Mark 对这个结果很满意,他同时向该服务发送了许多针对该单个视频的请求。他没有观察到磁盘活动。
Mark loads up the service with one video. He retrieves the video successfully. Happy with this result, Mark sends many requests for the single video in parallel to the service. He observes no disk activity.
Q 16.4根据目前的信息,Mark 没有观察到磁盘活动的最可能的解释是什么?
Q 16.4 Based on the information so far, what is the most likely explanation why Mark observes no disk activity?
Mark 对进展感到满意,他让服务准备好在生产模式下运行。他担心自己可能必须修改代码来处理并发问题——过去的经验告诉他需要学习一下,所以他正在阅读第 5 章。他考虑用锁来保护ADD_TO_CACHE :
Happy with the progress, Mark makes the service ready for running in production mode. He is worried that he may have to modify the code to deal with concurrency—his past experience has suggested to him that he needs an education, so he is reading Chapter 5. He considers protecting ADD_TO_CACHE with a lock:
lock instance cachelock // 缓存锁
lock instance cachelock // A lock for the cache
服务流程()
procedureSERVICE ()
永远做
do forever
请求← RECEIVE_MESSAGE ()
request ← RECEIVE_MESSAGE ()
文件← NULL
file ← NULL
对于 k 从1到131072执行
for k from 1 to 131072 do
块← LOOK_IN_CACHE (请求, k )
block ← LOOK_IN_CACHE (request, k)
如果 block = NULL 那么
if block = NULL then
块← GET_BLOCK_FROM_DISK (请求, k )
block ← GET_BLOCK_FROM_DISK (request, k)
ACQUIRE ( cachelock ) //使用锁
ACQUIRE (cachelock) // use the lock
ADD_TO_CACHE (请求,块, k )
ADD_TO_CACHE (request, block, k)
RELEASE ( cachelock ) // 这里也一样
RELEASE (cachelock) // here, too
文件←文件 + 块
file ← file + block
回复(文件)
REPLY (file)
Q 16.5 Ben 认为这些修改没有用。Ben 是对的吗?
Q 16.5 Ben argues that these modifications are not useful. Is Ben right?
马克不喜欢思考,所以他升级了 OutOfMoney 的网站,使用带锁的多线程代码。升级后的网站上线后,马克发现大多数用户都在观看相同的三个视频,而少数用户则在观看其他视频。
Mark doesn’t like thinking, so he upgrades OutOfMoney’s Web site to use the multithreaded code with locks. When the upgraded Web site goes live, Mark observes that most users watch the same three videos, while a few are watching other videos.
Q 16.6 Mark 观察到缓存中块的命中率为 90%。假设磁盘是主要瓶颈(从缓存提供块需要 0 毫秒),SERVICE提供一部电影的平均时间是多少?
Q 16.6 Mark observes a hit-ratio of 90% for blocks in the cache. Assuming that the disk is the main bottleneck (serving blocks from the cache takes 0 milliseconds), what is the average time for SERVICE to serve a single movie?
Q 16.7 Mark 将新的 Britney Spears 视频加载到服务中,并在第一批用户开始观看时观察操作。该视频非常受欢迎,没有用户在观看任何其他视频。Mark 看到第一批观众都大约在同一时间开始观看视频。他观察到服务线程都大约在同一时间读取块 0,然后都大约在同一时间读取块 1,依此类推。对于此工作负载,什么是好的缓存替换策略?
Q 16.7 Mark loads a new Britney Spears video onto the service and observes operation as the first users start to view it. It is so popular that no users are viewing any other video. Mark sees that the first batch of viewers all start watching the video at about the same time. He observes that the service threads all read block 0 at about the same time, then all read block 1 at about the same time, and so on. For this workload what is a good cache replacement policy?
市场部对这一进展感到非常高兴。Ben 通过出售他的宝马筹集了另一轮资金,并发起了另一场公关活动。用户数量急剧增加。不幸的是,在高负载下,机器停止处理请求,必须重新启动。结果,一些用户不得不从头开始重新播放视频,他们打电话给支持部门投诉。问题似乎是网络驱动程序和服务线程之间的一些交互。驱动程序和服务线程共享一个固定大小的输入缓冲区,该缓冲区可容纳 1,000 条请求消息。如果缓冲区已满并且有消息到达,驱动程序将丢弃该消息。当卡从网络接收数据时,它会向处理器发出中断。此中断导致网络驱动程序立即在当前正在运行的线程的堆栈上运行。驱动程序和RECEIVE_MESSAGE的代码如下:
The Marketing Department is extremely happy with the progress. Ben raises another round of money by selling his BMW and launches another PR campaign. The number of users dramatically increases. Unfortunately, under high load the machine stops serving requests and has to be restarted. As a result, some users have to restart their videos from the beginning, and they call up the support department to complain. The problem appears to be some interaction between the network driver and the service threads. The driver and service threads share a fixed-sized input buffer that can hold 1,000 request messages. If the buffer is full and a message arrives, the driver drops the message. When the card receives data from the network, it issues an interrupt to the processor. This interrupt causes the network driver to run immediately on the stack of the currently running thread. The code for the driver and RECEIVE_MESSAGE is as follows:
缓冲区[1000]
buffer[1000]
锁定 实例 缓冲区锁
lock instance bufferlock
程序 驱动程序()
procedure DRIVER ()
消息← READ_FROM_INTERFACE()
message ← READ_FROM_INTERFACE ()
获取(缓冲锁)
ACQUIRE (bufferlock)
如果 SPACE_IN_BUFFER ( )则 ADD_TO_BUFFER (消息)
if SPACE_IN_BUFFER ( ) then ADD_TO_BUFFER (message)
否则 DISCARD_MESSAGE(消息)
else DISCARD_MESSAGE (message)
释放(缓冲锁)
RELEASE (bufferlock)
过程 RECEIVE_MESSAGE ()
procedure RECEIVE_MESSAGE ()
当 BUFFER_IS_EMPTY ( )执行 YIELD ( )
while BUFFER_IS_EMPTY ( ) do YIELD ()
获取(缓冲锁)
ACQUIRE (bufferlock)
消息← REMOVE_FROM_BUFFER ()
message ← REMOVE_FROM_BUFFER ()
释放(缓冲锁)
RELEASE (bufferlock)
返回 消息
return message
程序 中断()
procedure INTERRUPT ()
司机()
DRIVER ()
Q 16.8 Which of the following could happen under high load?
A. Deadlock when an arriving message interrupts DRIVER.
B.当到达的消息中断处于RECEIVE_MESSAGE状态的线程时发生死锁。
B. Deadlock when an arriving message interrupts a thread that is in RECEIVE_MESSAGE.
C。当到达的消息中断处于REMOVE_FROM_BUFFER状态的线程时发生死锁。
C. Deadlock when an arriving message interrupts a thread that is in REMOVE_FROM_BUFFER.
D.当缓冲区不为空时,RECEIVE_MESSAGE会错过对YIELD的调用,因为它可能会在BUFFER_IS_EMPTY测试和对YIELD的调用之间被中断。
D. RECEIVE_MESSAGE misses a call to YIELD when the buffer is not empty, because it can be interrupted between the BUFFER_IS_EMPTY test and the call to YIELD.
Q 16.9 What fixes should Mark implement?
A. Delete all the code dealing with locks.
B. DRIVER should run as a separate thread, to be awakened by the interrupt.
C。INTERRUPT和DRIVER应该使用事件计数进行序列协调。
C. INTERRUPT and DRIVER should use an eventcount for sequence coordination.
马克消除了死锁问题,为了吸引更多用户,他宣布将推出新的布兰妮·斯皮尔斯视频。这个消息迅速传播开来,大量针对这个视频的请求开始冲击服务。随着越来越多的客户端请求该视频,马克测量了服务的吞吐量。结果图绘制如下。吞吐量首先随着客户端数量的增加而增加,然后达到最大值,最后下降。
Mark eliminates the deadlock problems and, to attract more users, announces the availability of a new Britney Spears video. The news spreads rapidly, and an enormous number of requests for this one video start hitting the service. Mark measures the throughput of the service as more and more clients ask for the video. The resulting graph is plotted below. The throughput first increases while the number of clients increases, then reaches a maximum value, and finally drops off.
Q 16.10 Why does the throughput decrease with a large number of clients?
* Credit for developing this problem set goes to Lewis D. Girod.
*开发此问题集的功劳归于 Samuel R. Madden。
* Credit for developing this problem set goes to Samuel R. Madden.
* Credit for developing this problem set goes to Stephen A. Ward.
*开发此问题集的功劳归于 Hari Balakrishnan。
* Credit for developing this problem set goes to Hari Balakrishnan.
* Credit for developing this problem set goes to Stephen A. Ward.
*开发此问题集的功劳归于 Robert T. Morris。
* Credit for developing this problem set goes to Robert T. Morris.
*开发此问题集的功劳归于 Samuel R. Madden。
* Credit for developing this problem set goes to Samuel R. Madden.
* Brian N. Bershad、David D. Redell 和 John R. Ellis。单处理器的快速互斥。第五届编程语言和操作系统架构支持国际会议论文集(1992 年 10 月),第 223-233 页。
* Brian N. Bershad, David D. Redell, and John R. Ellis. Fast mutual exclusion for uniprocessors. Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (October 1992), pages 223–233.
*开发此问题集的功劳归于 David K. Gifford。
* Credit for developing this problem set goes to David K. Gifford.
* Credit for developing this problem set goes to Eddie Kohler.
* Ben 在阅读了 Ted Kaehler 的论文《面向对象语言的虚拟内存》[进一步阅读建议 6.1.4] 后选择了这个名字。在那篇论文中,Kaehler 描述了一种称为面向对象区域环境的内存管理系统,缩写为 OOZE。
* Ben chose this name after reading a paper by Ted Kaehler, “Virtual memory for an object-oriented Language” [Suggestions for Further Reading 6.1.4]. In that paper, Kaehler describes a memory management system called the Object-Oriented Zoned Environment, with the acronym OOZE.
词汇表
Glossary
中止在决定全有或全无操作不能或不应提交时,撤消该全有或全无操作之前所做的所有更改。中止后,从实施全有或全无操作的层以上的任何人来说,系统的状态就像全有或全无操作从未存在过一样。与提交进行比较。[第 9 章]
Abort Upon deciding that an all-or-nothing action cannot or should not commit, to undo all of the changes previously made by that all-or-nothing action. After aborting, the state of the system, as viewed by anyone above the layer that implements the all-or-nothing action, is as if the all-or-nothing action never existed. Compare with commit. [Ch. 9]
绝对路径名在命名层次结构中,名称解析器使用称为根上下文的通用上下文来解析的路径名。[Ch. 2]
Absolute path name In a naming hierarchy, a path name that a name resolver resolves by using a universal context known as the root context. [Ch. 2]
抽象将模块的接口规范与其内部实现分离,以便人们可以理解和使用该模块,而无需知道其内部是如何实现的。[第 1 章]
Abstraction The separation of the interface specification of a module from its internal implementation so that one can understand and make use of that module with no need to know how it is implemented internally. [Ch. 1]
访问控制列表 (ACL)有权访问某个对象的主体列表。[第 11 章]
Access control list (ACL) A list of principals authorized to have access to some object. [Ch. 11]
确认 (ACK)通信接收方向发送方发送的状态报告。根据协议的不同,确认可能暗示或明确地表明几件事 — 例如,通信已收到、其校验和已正确验证、已成功传送到更高级别,或者缓冲区空间可用于另一次通信。与否定确认相比较。[第 2 章]
Acknowledgment (ACK) A status report from the recipient of a communication to the originator. Depending on the protocol, an acknowledgment may imply or explicitly state any of several things—for example, that the communication was received, that its checksum verified correctly, that delivery to a higher level was successful, or that buffer space is available for another communication. Compare with negative acknowledgment. [Ch. 2]
动作解释器执行的操作。示例包括微代码步骤、机器指令、高级语言指令、过程调用、shell 命令行、对图形界面手势的响应或数据库更新。[第 9 章]
Action An operation performed by an interpreter. Examples include a microcode step, a machine instruction, a higher-level language instruction, a procedure invocation, a shell command line, a response to a gesture at a graphical interface, or a database update. [Ch. 9]
活动故障当前正在导致错误的故障。与潜在故障相对照。[第 8 章]
Active fault A fault that is currently causing an error. Compare with latent fault. [Ch. 8]
自适应路由一种设置转发表的方法,当网络中添加或删除链接时,或当拥塞导致路径变得不太理想时,转发表会自动更改。与静态路由相比较。[第 7 章]
Adaptive routing A method for setting up forwarding tables so that they change automatically when links are added to and deleted from the network or when congestion makes a path less desirable. Compare with static routing. [Ch. 7]
地址包含大量可用于定位命名对象的信息的名称。在计算机系统中,地址通常长度固定,由硬件通过映射到几何坐标来解析为物理位置。地址的示例包括内存字节和磁盘轨道的名称。另请参阅网络地址。 [第 2 章]
Address A name that is overloaded with information useful for locating the named object. In a computer system, an address is usually of fixed length and resolved by hardware into a physical location by mapping to geometric coordinates. Examples of addresses include the names for a byte of memory and for a disk track. Also see network address. [Ch. 2]
地址解析协议 (ARP)当广播网络是数据包转发网络的组成部分时使用的协议。该协议动态构建表格,将广播网络的站点标识符映射到数据包转发网络的网络接入点标识符。[第 7 章]
Address resolution protocol (ARP) A protocol used when a broadcast network is a component of a packet-forwarding network. The protocol dynamically constructs tables that map station identifiers of the broadcast network to network attachment point identifiers of the packet-forwarding network. [Ch. 7]
地址空间位置寻址内存的名称空间,通常是一组连续的整数(0、1、2、…)。[第 2 章]
Address space The name space of a location-addressed memory, usually a set of contiguous integers (0, 1, 2,…). [Ch. 2]
对手故意试图破坏计算机系统安全措施的实体。该实体可能是恶意的、以牟利为目的的或仅仅是黑客。友好对手是测试计算机系统安全性的对手。[第 11 章]
Adversary An entity that intentionally tries to defeat the security measures of a computer system. The entity may be malicious, out for profit, or just a hacker. A friendly adversary is one that tests the security of a computer system. [Ch. 11]
广告在网络层路由协议中,参与者告知其他参与者它知道如何到达哪些网络地址。[第 7 章]
Advertise In a network-layer routing protocol, for a participant to tell other participants which network addresses it knows how to reach. [Ch. 7]
别名映射到同一值的多个名称之一;同义词的另一个术语。(注意:某些操作系统将别名定义为间接名称。)[第 2 章]
Alias One of multiple names that map to the same value; another term for synonym. (Beware: some operating systems define alias to mean an indirect name.)[Ch. 2]
全有或全无原子性多步骤操作的一种属性,如果在操作步骤期间发生预期失败,则从其调用者的角度来看,该操作的效果要么永远不会启动,要么已经完全完成。请与前后原子性和原子性进行比较。[第 9 章]
All-or-nothing atomicity A property of a multistep action that if an anticipated failure occurs during the steps of the action, the effect of the action from the point of view of its invoker is either never to have started or else to have been accomplished completely. Compare with before-or-after atomicity and atomic. [Ch. 9]
任意到任意连接通信网络的一个理想属性,即任何节点都能够与任何其他节点进行通信。[第 7 章]
Any-to-any connection A desirable property of a communication network, that any node be able to communicate with any other. [Ch. 7]
档案通常以日志的形式保存旧数据值的记录,用于审计、应用程序错误恢复或历史兴趣。[第 9 章]
Archive A record, usually kept in the form of a log, of old data values, for auditing, recovery from application mistakes, or historical interest. [Ch. 9]
异步(源自希腊语词根,意为“不计时”) 1. 描述不受公共时钟协调的并发活动,因此可能以不同的速率进行。例如,多个处理器通常是异步的,I/O 操作通常由与启动 I/O 的处理器异步的 I/O 通道处理器执行。[第 2 章] 2. 在通信网络中,描述一种通信链路,通过该链路以帧的形式发送数据,这些帧相对于其他帧的时序不可预测,并且其长度可能不一致。与等时比较。[第 7 章]
Asynchronous (From Greek roots meaning “not timed”) 1. Describes concurrent activities that are not coordinated by a common clock and thus may make progress at different rates. For example, multiple processors are usually asynchronous, and I/O operations are typically performed by an I/O channel processor that is asynchronous with respect to the processor that initiated the I/O. [Ch. 2] 2. In a communication network, describes a communication link over which data is sent in frames whose timing relative to other frames is unpredictable and whose lengths may not be uniform. Compare with isochronous. [Ch. 7]
至少一次一种协议保证,预期的操作或消息传递至少执行一次。也可能执行多次。[第 4 章]
At-least-once A protocol assurance that the intended operation or message delivery was performed at least one time. It may have been performed several times. [Ch. 4]
最多一次一种协议保证,预期的操作或消息传递最多执行一次。也可能根本没有执行。[第 4 章]
At-most-once A protocol assurance that the intended operation or message delivery was performed no more than one time. It may not have been performed at all. [Ch. 4]
原子性 (adj.);原子性 (n.)多步骤操作的一种属性,即没有证据表明它在实现它的层之上是复合的。原子操作可以是之前或之后的,这意味着它的效果就好像它完全发生在任何其他之前或之后的操作之前或之后。原子操作也可以是全有或全无的,这意味着如果在操作期间发生预期的故障,则更高层认为该操作的效果要么永远不会启动,要么成功完成。全有或全无且之前或之后的原子操作称为事务。 [第 9 章]
Atomic (adj.); Atomicity (n.) A property of a multistep action that there be no evidence that it is composite above the layer that implements it. An atomic action can be before-or-after, which means that its effect is as if it occurred either completely before or completely after any other before-or-after action. An atomic action can also be all-or-nothing, which means that if an anticipated failure occurs during the action, the effect of the action as seen by higher layers is either never to have started or else to have completed successfully. An atomic action that is both all-or-nothing and before-or-after is known as a transaction. [Ch. 9]
原子存储多单元PUT只能产生两种结果的单元存储:(1) 成功存储所有数据,或 (2) 完全不更改以前的数据。因此,并发线程或(在发生故障后)执行 GET 的后续线程将始终读取所有旧数据或所有新数据。多单元PUT不是原子的计算机体系结构被认为容易受到写入撕裂的影响。[第 9 章]
Atomic storage Cell storage for which a multicell PUT can have only two possible outcomes: (1) it stores all data successfully, or (2) it does not change the previous data at all. In consequence, either a concurrent thread or (following a failure) a later thread doing a GET will always read either all old data or all new data. Computer architectures in which multicell PUTs are not atomic are said to be subject to write tearing. [Ch. 9]
身份验证验证主体的身份或消息的真实性。[第 11 章]
Authentication Verifying the identity of a principal or the authenticity of a message. [Ch. 11]
认证标签与消息关联的加密计算字符串,允许接收者验证消息的真实性。[第 11 章]
Authentication tag A cryptographically computed string, associated with a message, that allows a receiver to verify the authenticity of the message. [Ch. 11]
自动速率自适应一种技术,发送方通过该技术自动调整将数据包引入网络的速率,以匹配最窄瓶颈可以处理的最大速率。[第 7 章]
Automatic rate adaptation A technique by which a sender automatically adjusts the rate at which it introduces packets into a network to match the maximum rate that the narrowest bottleneck can handle. [Ch. 7]
授权权威机构做出的决定,授予主体执行某些操作的权限,例如读取某些信息。[第 11 章]
Authorization A decision made by an authority to grant a principal permission to perform some operation, such as reading certain information. [Ch. 11]
可用性系统实际可用时间的度量,占预期可用时间的一小部分。与其补充内容“停机时间”相比较。[第 8 章]
Availability A measure of the time that a system was actually usable, as a fraction of the time that it was intended to be usable. Compare with its complement, down time. [Ch. 8]
备份副本一组未同步写入或更新的副本,稍后写入的副本。与主副本和镜像相比较。[第 10 章]
Backup copy Of a set of replicas that is not written or updated synchronously, one that is written later. Compare with primary copy and mirror. [Ch. 10]
后向纠错一种纠错技术,其中数据或控制信号源应用足够的冗余以允许检测到错误,并且如果确实发生错误,则要求该源重新进行计算或重复传输。与前向纠错进行比较。[第 8 章]
Backward error correction A technique for correcting errors in which the source of the data or control signal applies enough redundancy to allow errors to be detected and, if an error does occur, that source is asked to redo the calculation or repeat the transmission. Compare with forward error correction. [Ch. 8]
坏消息二极管设计和实施系统的组织中的人们有一种不良倾向:好消息(例如,某个模块已准备好提前交付)往往会立即在整个组织中传递,而坏消息(例如,某个模块未通过验收测试)往往会在本地保留,直到问题得到解决或无法再隐瞒为止。[第 1 章]
Bad-news diode An undesirable tendency of people in organizations that design and implement systems: good news, for example, that a module is ready for delivery ahead of schedule, tends to be passed immediately throughout the organization, but bad news, for example, that a module did not pass its acceptance tests, tends to be held locally until either the problem can be fixed or it cannot be concealed any longer. [Ch. 1]
带宽通信信道模拟频谱空间的度量。信道的带宽、可接受的信号功率和噪声水平共同决定了该信道的最大可能数据速率。在数字系统中,这个术语经常被误用为最大数据速率的同义词,因此现在它以附加含义进入了数字设计师的词汇表。然而,模拟工程师仍然对这种用法感到畏惧。[第 7 章]
Bandwidth A measure of analog spectrum space for a communication channel. The bandwidth, the acceptable signal power, and the noise level of a channel together determine the maximum possible data rate for that channel. In digital systems, this term is so often misused as a synonym for maximum data rate that it has now entered the vocabulary of digital designers with that additional meaning. Analog engineers, however, still cringe at that usage. [Ch. 7]
批处理一种通过将多个操作组合成单个操作来减少设置开销以提高性能的技术。[Ch. 6]
Batching A technique to improve performance by combining several operations into a single operation to reduce setup overhead. [Ch. 6]
前后原子性并发操作的一个属性:如果从调用者的角度来看,并发操作的效果与操作完全在前或完全在后发生的效果相同,则并发操作是前后操作。一个结果是,并发的前后软件操作无法发现彼此的复合性质(即,一个操作无法分辨另一个操作有多个步骤)。硬件情况下的一个结果是,对同一存储单元的并发前后WRITE将按某种顺序执行,因此不存在单元最终包含多个WRITE值的OR等危险。数据库文献使用“隔离”和“可序列化”等词,操作系统文献使用“互斥”和“临界区”等词,计算机架构文献使用不合格的词“原子性”来表示这一概念。[第 5 章]与全有或全无原子性和原子性进行比较。[第 9 章]
Before-or-after atomicity A property of concurrent actions: Concurrent actions are before-or-after actions if their effect from the point of view of their invokers is the same as if the actions occurred either completely before or completely after one another. One consequence is that concurrent before-or-after software actions cannot discover the composite nature of one another (that is, one action cannot tell that another has multiple steps). A consequence in the case of hardware is that concurrent before-or-after WRITEs to the same memory cell will be performed in some order, so there is no danger that the cell will end up containing, for example, the OR of several WRITE values. The database literature uses the words “isolation” and “serializable”, the operating system literature uses the words “mutual exclusion” and “critical section”, and the computer architecture literature uses the unqualified word “atomicity” for this concept. [Ch. 5] Compare with all-or-nothing atomicity and atomic. [Ch. 9]
尽力而为契约转发网络在接受数据包时做出的承诺:它将尽最大努力递送数据包,但递送时间不固定,相对于发送到同一目的地的其他数据包的递送顺序不可预测,并且数据包可能会重复或丢失。[第 7 章]
Best-effort contract The promise given by a forwarding network when it accepts a packet: it will use its best effort to deliver the packet, but the time to delivery is not fixed, the order of delivery relative to other packets sent to the same destination is unpredictable, and the packet may be duplicated or lost. [Ch. 7]
绑定 (n.);绑定 (v.)在命名中,将指定名称映射到指定上下文中的特定值。当存在绑定时,该名称被称为已绑定。绑定可能发生在名称解析之前的任何时间。该术语也更广泛地使用,意思是为某些较高层功能选择特定的较低层实现。[第 2 章]
Binding (n.); Bind (v.) As used in naming, a mapping from a specified name to a particular value in a specified context. When a binding exists, the name is said to be bound. Binding may occur at any time up to and including the instant that a name is resolved. The term is also used more generally, meaning to choose a specific lower-layer implementation for some higher-layer feature. [Ch. 2]
误码率数字传输系统中,具有错误值的比特到达接收器的速率,以传输比特的分数表示,例如 10 10 分之一。 [第 7 章]
Bit error rate In a digital transmission system, the rate at which bits that have incorrect values arrive at the receiver, expressed as a fraction of the bits transmitted, for example, one in 1010. [Ch. 7]
位填充一种技术,在位流中插入一个位模式作为标记,然后在流中的其他位置插入位,以确保有效载荷数据永远不会与标记位模式匹配。[第 7 章]
Bit stuffing The technique of inserting a bit pattern as a marker in a stream of bits and then inserting bits elsewhere in the stream to ensure that payload data never matches the marker bit pattern. [Ch. 7]
盲写通过之前未读取过X的事务对数据值X进行更新。[第 9 章]
Blind write An update to a data value X by a transaction that did not previously read X. [Ch. 9]
引导法一种解决一般问题的系统方法,包括将一般问题简化为同一问题的特殊实例的方法和解决特殊实例的方法。[第 5 章]
Bootstrapping A systematic approach to solving a general problem, consisting of a method for reducing the general problem to a specialized instance of the same problem and a method for solving the specialized instance. [Ch. 5]
瓶颈多级流水线中的一个阶段,其执行任务所需的时间比任何其他阶段都要长。[第 6 章]
Bottleneck The stage in a multistage pipeline that takes longer to perform its task than any of the other stages. [Ch. 6]
广播发送数据包,该数据包旨在被广播链路(链路层广播)的许多(理想情况下是所有)站点接收,或被网络(网络层广播)的所有目标地址接收。[第 7 章]
Broadcast To send a packet that is intended to be received by many (ideally, all) of the stations of a broadcast link (link-layer broadcast), or all the destination addresses of a network (network-layer broadcast). [Ch. 7]
突发一批相关比特,其大小和时间相对于其他此类批次而言是不规则的。数据突发是消息的常见内容和数据包的常见有效载荷。还可能出现噪声突发和数据包突发。[第 7 章]
Burst A batch of related bits that is irregular in size and timing relative to other such batches. Bursts of data are the usual content of messages and the usual payload of packets. One can also have bursts of noise and bursts of packets. [Ch. 7]
拜占庭错误产生不一致错误(可能是恶意的)的故障,可能会混淆或破坏容错或安全机制。[第 8 章]
Byzantine fault A fault that generates inconsistent errors (perhaps maliciously) that can confuse or disrupt fault tolerance or security mechanisms. [Ch. 8]
缓存一种性能增强模块,用于记住昂贵计算的结果,以便可能很快再次需要该结果。[Ch. 2]
Cache A performance-enhancing module that remembers the result of an expensive computation on the chance that the result may soon be needed again. [Ch. 2]
缓存一致性具有缓存的多级内存系统的读/写一致性。这是缓存在其接口上提供严格一致性的规范。[第 10 章]
Cache coherence Read/write coherence for a multilevel memory system that has a cache. It is a specification that the cache provide strict consistency at its interface. [Ch. 10]
能力在计算机系统中,不可伪造的票据,出示时视为无可辩驳的证据,证明出示者有权访问票据中指定的对象。[第 11 章]
Capability In a computer system, an unforgeable ticket, which when presented is taken as incontestable proof that the presenter is authorized to have access to the object named in the ticket. [Ch. 11]
容量对资源大小或数量的任何一致衡量标准。[第 6 章]
Capacity Any consistent measure of the size or amount of a resource. [Ch. 6]
单元存储WRITE或PUT通过覆盖操作的存储,从而破坏先前存储的信息。许多物理存储设备(包括磁盘和 CMOS 随机存取存储器)都实现了单元存储。与日志存储进行比较。[第 9 章]
Cell storage Storage in which a WRITE or PUT operates by overwriting, thus destroying previously stored information. Many physical storage devices, including magnetic disk and CMOS random access memory, implement cell storage. Compare with journal storage. [Ch. 9]
证书证明主体标识符与加密密钥绑定的消息。[第 11 章]
Certificate A message that attests the binding of a principal identifier to a cryptographic key. [Ch. 11]
证书颁发机构 (CA)颁发并签署证书的主体。[第 11 章]
Certificate authority (CA) A principal that issues and signs certificates. [Ch. 11]
认证检查安全机制的准确性、正确性和完整性。[第 11 章]
Certify To check the accuracy, correctness, and completeness of a security mechanism. [Ch. 11]
检查点1.(名词)写入非易失性存储的信息,旨在加快崩溃后的恢复速度。2.(动词)写入检查点。[第 9 章]
Checkpoint 1. (n.) Information written to non-volatile storage that is intended to speed up recovery from a crash. 2. (v.) To write a checkpoint. [Ch. 9]
校验和一种程式化的错误检测代码,其中的数据与未编码形式相同,而额外的冗余数据则放置在不同的、单独构建的字段中。[第 7 章]
Checksum A stylized error-detection code in which the data is unchanged from its uncoded form and additional, redundant data is placed in a distinct, separately architected field. [Ch. 7]
密码加密转换的同义词。[第 11 章]
Cipher Synonym for a cryptographic transformation. [Ch. 11]
密文加密的结果。与明文比较。[第 11 章]
Ciphertext The result of encryption. Compare with plaintext. [Ch. 11]
电路交换机一种具有多条电路接入的设备,可以将任何电路连接到任何其他电路;它可以同时执行许多这样的连接。从历史上看,电话系统是由电路交换机构成的。[第 7 章]
Circuit switch A device with many electrical circuits coming in to it that can connect any circuit to any other circuit; it may be able to perform many such connections simultaneously. Historically, telephone systems were constructed of circuit switches. [Ch. 7]
明文与纯文本同义。[第 11 章]
Cleartext Synonym for plaintext. [Ch. 11]
客户端发起操作(如向服务发送请求)的模块。[第 4 章]在网络的端到端层,发起操作的一端。请与服务进行比较。[第 7 章]
Client A module that initiates actions, such as sending a request to a service. [Ch. 4] At the end-to-end layer of a network, the end that initiates actions. Compare with service. [Ch. 7]
客户/服务组织通过将模块之间的交互限制为消息来强制计算机系统模块之间的模块化的组织。[Ch. 4]
Client/service organization An organization that enforces modularity among modules of a computer system by limiting the interaction among the modules to messages. [Ch. 4]
关闭至打开一致性文件操作的一致性模型。当一个线程打开一个文件并执行多个写入操作时,只有在第一个线程关闭该文件后,所有修改才会对并发线程可见。[第 2 章]
Close-to-open consistency A consistency model for file operations. When a thread opens a file and performs several write operations, all of the modifications will be visible to concurrent threads only after the first thread closes the file. [Ch. 2]
闭包:在编程语言中,闭包是指一个对象,它由对过程文本的引用和对程序解释器用于解析过程变量的上下文的引用组成。[第 2 章]
Closure In a programming language, an object that consists of a reference to the text of a procedure and a reference to the context in which the program interpreter is to resolve the variables of the procedure. [Ch. 2]
一致性请参阅读/写一致性或缓存一致性。
Coherence See read/write coherence or cache coherence.
冲突1. 在命名中,一种特殊类型的名称冲突,其中算法名称生成器意外地在本应是唯一标识符名称空间中多次生成相同的名称。[第 3 章] 2. 在网络中,当两个站点试图同时通过同一物理介质发送消息时发生的事件。另请参阅以太网。[第 7 章]
Collision 1. In naming, a particular kind of name conflict in which an algorithmic name generator accidentally generates the same name more than once in what is intended to be a unique identifier name space. [Ch. 3] 2. In networks, an event when two stations attempt to send a message over the same physical medium at the same time. See also Ethernet. [Ch. 7]
提交放弃单方面放弃全有或全无行动的能力。通常在将全有或全无行动的结果提供给并发或后续的全有或全无行动之前,先提交该行动。在提交之前,可以放弃全有或全无行动,并可以假装从未进行过该行动。提交后,全有或全无行动必须能够完成。已提交的全有或全无行动不能放弃;如果可以准确确定其结果传播的程度,则可能通过补偿来逆转其部分或全部影响。承诺通常还包括一种期望,即结果将保留任何适当的不变量,并在应用程序需要这些属性的范围内持久。与补偿和中止进行比较。[第 9 章]
Commit To renounce the ability to abandon an all-or-nothing action unilaterally. One usually commits an all-or-nothing action before making its results available to concurrent or later all-or-nothing actions. Before committing, the all-or-nothing action can be abandoned and one can pretend that it had never been undertaken. After committing, the all-or-nothing action must be able to complete. A committed all-or-nothing action cannot be abandoned; if it can be determined precisely how far its results have propagated, it may be possible to reverse some or all of its effects by compensation. Commitment also usually includes an expectation that the results preserve any appropriate invariants and will be durable to the extent that the application requires those properties. Compare with compensate and abort. [Ch. 9]
通信链路物理上分离的组件之间的数据通信路径。[Ch. 2]
Communication link A data communication path between physically separated components. [Ch. 2]
补偿 (adj.);补偿 (n.)执行一项操作以撤销之前已执行操作的影响。补偿本质上依赖于应用程序;撤销错误的会计分录比填补不必要的漏洞更容易。[第 9 章]
Compensate (adj.); Compensation (n.) To perform an action that reverses the effect of some previously committed action. Compensation is intrinsically application dependent; it is easier to reverse an incorrect accounting entry than it is to undrill an unwanted hole. [Ch. 9]
复杂性一种定义不明确的概念,指系统具有太多组件、互连和不规则性,以致难以理解、实施和维护。[第 1 章]
Complexity A loosely defined notion that a system has so many components, interconnections, and irregularities that it is difficult to understand, implement, and maintain. [Ch. 1]
保密性将信息访问权限限制在授权主体。秘密是同义词。[第 11 章]
Confidentiality Limiting information access to authorized principals. Secrecy is a synonym. [Ch. 11]
限制允许潜在不受信任的程序访问数据,同时确保程序不能泄露信息。[第 11 章]
Confinement Allowing a potentially untrusted program to have access to data, while ensuring that the program cannot release information. [Ch. 11]
拥塞资源过载持续时间明显长于资源的平均服务时间。(由于意义取决于旁观者,因此这一概念并不精确。)[第 7 章]
Congestion Overload of a resource that persists for significantly longer than the average service time of the resource. (Since significance is in the eye of the beholder, the concept is not a precise one.) [Ch. 7]
拥塞崩溃当提供的负载增加导致完成的有用工作急剧减少时。[第 7 章]
Congestion collapse When an increase in offered load causes a catastrophic decrease in useful work accomplished. [Ch. 7]
连接需要在连续消息之间保持状态的通信路径。请参阅建立和拆除。 [第 7 章]
Connection A communication path that requires maintaining state between successive messages. See set up and tear down. [Ch. 7]
无连接描述不需要协调状态且无需设置或拆除即可使用的通信路径。请参阅连接。[第 7 章]
Connectionless Describes a communication path that does not require coordinated state and can be used without set up or tear down. See connection. [Ch. 7]
尽管存在通信故障,但不同站点仍就数据值达成共识。[第 10 章]
Consensus Agreement at separated sites on a data value despite communication failures. [Ch. 10]
一致性存储系统内存模型上的一个特定约束,允许并发并使用副本:所有读取器都看到相同的结果。在某些专业文献中也用作连贯性的同义词。[第 10 章]
Consistency A particular constraint on the memory model of a storage system that allows concurrency and uses replicas: that all readers see the same result. Also used in some professional literature as a synonym for coherence. [Ch. 10]
约束应用程序定义的一组数据值或外部可见操作的不变量。示例:要求银行所有账户的余额总和为零,或要求一组数据的大多数副本相同。[第 10 章]
Constraint An application-defined invariant on a set of data values or externally visible actions. Example: a requirement that the balances of all the accounts of a bank sum to zero, or a requirement that a majority of the copies of a set of data be identical. [Ch. 10]
上下文名称映射算法解析名称所需的输入之一。上下文的常见形式是一组名称到值的绑定。[第 2 章]
Context One of the inputs required by a name-mapping algorithm in order to resolve a name. A common form for a context is a set of name-to-value bindings. [Ch. 2]
上下文引用上下文的名称。[Ch. 2]
Context reference The name of a context. [Ch. 2]
连续运行可用性目标,即系统能够无限期运行。连续运行的主要要求是必须能够在不停止系统的情况下进行维修和维护。[第 8 章]
Continuous operation An availability goal, that a system be capable of running indefinitely. The primary requirement of continuous operation is that it must be possible to perform repair and maintenance without stopping the system. [Ch. 8]
控制点可以调整有限资源的容量或更改源提供的负载的实体。[第 7 章]
Control point An entity that can adjust the capacity of a limited resource or change the load that a source offers. [Ch. 7]
协作式调度一种线程调度方式,其中每个线程都会主动定期释放处理器以允许其他线程运行。[第 5 章]
Cooperative scheduling A style of thread scheduling in which each thread on its own initiative releases the processor periodically to allow other threads to run. [Ch. 5]
隐蔽通道在流控制安全系统中,一种将信息泄露到安全区域或从安全区域泄露信息的方式。例如,有权访问机密的程序可能会按某种模式触碰多个共享但通常未使用的虚拟内存页面,以将它们带入实际内存;安全区域外的共谋者可能能够通过测量读取相同共享页面所需的时间来检测该模式。[第 11 章]
Covert channel In a flow-control security system, a way of leaking information into or out of a secure area. For example, a program with access to a secret might touch several shared but normally unused virtual memory pages in a pattern to bring them into real memory; a conspirator outside the secure area may be able to detect the pattern by measuring the time required to read those same shared pages. [Ch. 11]
加密哈希函数一种将消息映射到短值的加密函数,使得 (1) 难以根据哈希值重建消息;(2) 难以构造两个具有相同值的不同消息。[第 11 章]
Cryptographic hash function A cryptographic function that maps messages to short values in such a way that it is difficult to (1) reconstruct a message from its hash value; and (2) construct two different messages having the same value. [Ch. 11]
加密密钥密钥驱动的加密转换中易于更改的组件。加密密钥是一串位。这些位可能是随机生成的,也可能是密码的转换版本。加密密钥(或至少其中的一部分)通常必须保密,而转换的所有其他组件则可以公开。[第 11 章]
Cryptographic key The easily changeable component of a key-driven cryptographic transformation. A cryptographic key is a string of bits. The bits may be generated randomly, or they may be a transformed version of a password. The cryptographic key, or at least part of it, usually must be kept secret, while all other components of the transformation can be made public. [Ch. 11]
加密变换用作实现安全原语的构建块的数学变换。此类构建块包括用于实现加密和解密、创建和验证身份验证标签、加密哈希和伪随机数生成器的函数。[第 11 章]
Cryptographic transformation Mathematical transformation used as a building block for implementing security primitives. Such building blocks include functions for implementing encryption and decryption, creating and verifying authentication tags, cryptographic hashes, and pseudorandom number generators. [Ch. 11]
密码学理论计算机科学的一门学科,专门研究密码转换和协议。[第 11 章]
Cryptography A discipline of theoretical computer science that specializes in the study of cryptographic transformations and protocols. [Ch. 11]
直通一种转发技术,在传出链路上开始传输数据包或帧,而传入链路上仍在接收该数据包或帧。[第 7 章]
Cut-through A forwarding technique in which transmission of a packet or frame on an outgoing link begins while the packet or frame is still being received on the incoming link. [Ch. 7]
拖延一种通过延迟请求来提高性能的技术,因为该操作可能不需要执行,或者可以创造更多的批处理机会。[Ch. 6]
Dallying A technique to improve performance by delaying a request on the chance that the operation won’t be needed, or to create more opportunities for batching. [Ch. 6]
悬垂引用使用已超过其绑定期限的名称。[第 3 章]
Dangling reference Use of a name that has outlived the binding of that name. [Ch. 3]
数据完整性消息或文件表观内容的真实性。[第 11 章] 在网络中,传输协议保证传递给接收方的数据与发送方提供的原始数据完全相同。与来源真实性相比较。[第 7 章]
Data integrity Authenticity of the apparent content of a message or file. [Ch. 11] In a network, a transport protocol assurance that the data delivered to the recipient is identical to the original data the sender provided. Compare with origin authenticity. [Ch. 7]
数据速率数据通过通信链路发送的速率,通常以比特/秒为单位。在谈论异步通信链路的数据速率时,该术语通常用于表示链路允许的最大数据速率。[第 7 章]
Data rate The rate, usually measured in bits per second, at which bits are sent over a communication link. When talking of the data rate of an asynchronous communication link, the term is often used to mean the maximum data rate that the link allows. [Ch. 7]
死锁一组线程之间不良的交互,其中每个线程都在等待组中的其他线程取得进展。[第 5 章]
Deadlock Undesirable interaction among a group of threads in which each thread is waiting for some other thread in the group to make progress. [Ch. 5]
衰减随着时间的推移,存储状态意外丧失。[第 2 章]
Decay Unintended loss of stored state with the passage of time. [Ch. 2]
衰变集一组存储块、字、轨道或其他物理分组,其中该集合的所有成员可能会自发地一起失效,但独立于任何其他衰变集。[第 8 章]
Decay set A set of storage blocks, words, tracks, or other physical groupings, in which all members of the set may spontaneously fail together, but independently of any other decay set. [Ch. 8]
解密对先前加密的消息执行反向加密转换以获取明文。与加密比较。[第 11 章]
Decrypt To perform a reverse cryptographic transformation on a previously encrypted message to obtain the plaintext. Compare with encrypt. [Ch. 11]
默认上下文引用由名称解析器选择的上下文引用,而不是作为名称的一部分或由使用该名称的对象指定。与显式上下文引用相比较。[第 2 章]
Default context reference A context reference chosen by the name resolver rather than specified as part of the name or by the object that used the name. Compare with explicit context reference. [Ch. 2]
请求调页一种页面移动算法,仅在使用页面时才将其移入主设备。与预调页相比较。[第 6 章]
Demand paging A class of page-movement algorithm that moves pages into the primary device only at the instant that they are used. Compare with prepaging. [Ch. 6]
目的地数据包的有效负载将传送到的网络连接点。有时用作目标地址的简写。[第 7 章]
Destination The network attachment point to which the payload of a packet is to be delivered. Sometimes used as shorthand for destination address. [Ch. 7]
目标地址数据包目标的标识符,通常作为数据包头中的字段承载。[第 7 章]
Destination address An identifier of the destination of a packet, usually carried as a field in the header of the packet. [Ch. 7]
可检测错误可以设计出可靠检测方案的错误或错误类别。无法检测到的错误通常会导致故障,除非旨在掩盖其他错误的某种机制意外地掩盖了无法检测到的错误。请与可屏蔽错误和可容忍错误进行比较。[第 8 章]
Detectable error An error or class of errors for which a reliable detection plan can be devised. An error that is not detectable usually leads to a failure, unless some mechanism that is intended to mask some other error accidentally happens to mask the undetectable error. Compare with maskable error and tolerated error. [Ch. 8]
数字签名使用公钥加密计算的身份验证标签。[第 11 章]
Digital signature An authentication tag computed with public-key cryptography. [Ch. 11]
目录在文件系统中,由符号文件名与相应文件的某些描述(例如文件编号或文件映射)之间的绑定表组成的对象。用于此概念的其他术语包括目录和文件夹。目录是上下文的一个示例。[第 2 章]
Directory In a file system, an object consisting of a table of bindings between symbolic file names and some description (e.g., a file number or a file map) of the corresponding file. Other terms used for this concept include catalog and folder. A directory is an example of a context. [Ch. 2]
自主访问控制访问控制系统的一个属性。在自主访问控制系统中,对象的所有者有权决定哪些主体可以访问该对象。与非自主访问控制相比较。[第 11 章]
Discretionary access control A property of an access control system. In a discretionary access control system, the owner of an object has the authority to decide which principals have access to that object. Compare with non-discretionary access control. [Ch. 11]
执行操作(n.)某些系统中使用的术语,表示重复操作。[第 9 章]
Do action (n.) Term used in some systems for a redo action. [Ch. 9]
域线程可以访问的地址范围。它是在内存中强制模块化、分离模块并允许受控共享的抽象。[第 5 章]
Domain A range of addresses to which a thread has access. It is the abstraction that enforces modularity within a memory, separating modules and allowing for controlled sharing. [Ch. 5]
停机时间系统不可用时间的度量,占预期可用时间的一小部分。与其互补项可用性相比较。[第 8 章]
Down time A measure of the time that a system was not usable, as a fraction of the time that it was intended to be usable. Compare with its complement, availability. [Ch. 8]
双工描述两个站之间可以双向使用的链路或连接。与单工、半双工和全双工进行比较。[第 7 章]
Duplex Describes a link or connection between two stations that can be used in both directions. Compare with simplex, half-duplex, and full-duplex. [Ch. 7]
重复抑制一种传输协议机制,通过识别和丢弃数据包或消息的额外副本来实现最多一次的交付保证。[第 7 章]
Duplicate suppression A transport protocol mechanism for achieving at-most-once delivery assurance, by identifying and discarding extra copies of packets or messages. [Ch. 7]
耐用性存储介质的一种属性,一旦写入,就可以在应用程序需要的时间内读取。与稳定性和持久性相比,这两个术语有不同的技术定义,如边栏 2.1中所述。[第 2 章]
Durability A property of a storage medium that, once written, it can be read for as long as the application requires. Compare with stability and persistence, terms that have different technical definitions as explained in Sidebar 2.1. [Ch. 2]
持久存储存储具有 (理想情况下) 不衰减的特性,因此它绝不会因GET而返回由先前成功的PUT存储的数据。由于这种理想情况过于严格,因此在实践中,当故障概率足够低以至于应用程序可以容忍时,存储被认为是持久的。因此,持久性是应用程序定义的规范,规定操作完成后必须保留其结果多长时间。持久性不同于非易失性,非易失性描述的存储在断电时仍能保持其内存,但仍可能具有无法容忍的衰减概率。术语“持久性”有时用作持久性的同义词,如边栏 2.1中所述,但为了尽量减少混淆,本文避免使用这种用法。[第 8 章]
Durable storage Storage with the property that it (ideally) is decay-free, so it never fails to return on a GET the data that was stored by a previously successful PUT. Since that ideal is impossibly strict, in practice, storage is considered durable when the probability of failure is sufficiently low that the application can tolerate it. Durability is thus an application-defined specification of how long the results of an action, once completed, must be preserved. Durable is distinct from non-volatile, which describes storage that maintains its memory while the power is off, but may still have an intolerable probability of decay. The term persistent is sometimes used as a synonym for durable, as explained in Sidebar 2.1, but to minimize confusion this text avoids that usage. [Ch. 8]
动态作用域默认上下文的一个示例,用于解析某些编程语言中的程序变量名称。名称解析器在调用堆栈中向后搜索绑定,从使用该名称的过程的堆栈框架开始,然后是其调用者的堆栈框架,然后是调用者的调用者,依此类推。与静态作用域进行比较。[第 2 章]
Dynamic scope An example of a default context, used to resolve names of program variables in some programming languages. The name resolver searches backward in the call stack for a binding, starting with the stack frame of the procedure that used the name, then the stack frame of its caller, then the caller’s caller, and so on. Compare with static scope. [Ch. 2]
最早截止期限优先调度策略一种实时系统的调度策略,优先考虑具有最早截止期限的线程。[第 6 章]
Earliest deadline first scheduling policy A scheduling policy for real-time systems that gives priority to the thread with the earliest deadline. [Ch. 6]
提前放弃管理超载资源的预测策略:系统在队列满之前拒绝为某些客户提供服务。[第 7 章]
Early drop A predictive strategy for managing an overloaded resource: the system refuses service to some customers before the queue is full. [Ch. 7]
突现特性组件组合的属性,无法通过单独检查组件来预测。突现特性在第一次遇到时会让人感到惊讶。[第 1 章]
Emergent property A property of an assemblage of components that would not be predicted by examining the components individually. Emergent properties are a surprise when first encountered. [Ch. 1]
仿真 忠实地模拟某些物理硬件,使得模拟硬件可以运行物理硬件可以运行的任何软件。[第 5 章]
Emulation Faithfully simulating some physical hardware so that the simulated hardware can run any software that the physical hardware can. [Ch. 5]
加密对消息执行加密转换,以实现机密性。加密转换通常由密钥驱动。与逆操作解密(可以恢复原始消息)相比。[第 11 章]
Encrypt To perform a cryptographic transformation on a message with the objective of achieving confidentiality. The cryptographic transformation is usually key-driven. Compare with the inverse operation, decrypt, which can recover the original message. [Ch. 11]
端到端描述网络连接点之间的通信,与网络内点之间或单个链路上的通信不同。[第 7 章]
End-to-end Describes communication between network attachment points, as contrasted with communication between points within the network or across a single link. [Ch. 7]
端到端层管理端到端通信的通信系统层。[第 7 章]
End-to-end layer The communication system layer that manages end-to-end communications. [Ch. 7]
强制模块化模块化可防止意外错误从一个模块传播到另一个模块。与软模块化相比较。[第 4 章]
Enforced modularity Modularity that prevents accidental errors from propagating from one module to another. Compare with soft modularity. [Ch. 4]
枚举生成特定上下文中当前可解析的所有名称(即具有绑定的名称)的列表。[第 2 章]
Enumerate To generate a list of all the names that can currently be resolved (that is, that have bindings) in a particular context. [Ch. 2]
环境1. 在讨论系统时,指系统周围的一切,但不被视为系统的一部分。系统与其环境之间的区别是基于目的、描述的难易程度以及互连最小化的选择。[第 1 章] 2. 在解释器中,指解释器应执行程序指令所指示的操作的状态。[第 2 章]
Environment 1. In a discussion of systems, everything surrounding a system that is not viewed as part of that system. The distinction between a system and its environment is a choice based on the purpose, ease of description, and minimization of interconnections. [Ch. 1] 2. In an interpreter, the state on which the interpreter should perform the actions directed by program instructions. [Ch. 2]
环境引用解释器的组件,它告诉解释器在哪里可以找到其环境。[第 2 章]
Environment reference The component of an interpreter that tells the interpreter where to find its environment. [Ch. 2]
擦除位、字节或位串中的错误,其中标识的位、字节或位组缺失或具有不确定的值。[第 8 章]
Erasure An error in a string of bits, bytes, or groups of bits in which an identified bit, byte, or group of bits is missing or has indeterminate value. [Ch. 8]
遍历性某些时间相关概率过程的一种特性:在受该过程影响的一组元素上测量的某个参数的集合平均值(通常更容易测量)与集合中任何单个元素的该参数的时间平均值相同。[第 8 章]
Ergodic A property of some time-dependent probabilistic processes: that the (usually easier to measure) ensemble average of some parameter measured over a set of elements subject to the process is the same as the time average of that parameter of any single element of the ensemble. [Ch. 8]
错误非正式地,指由活动故障导致的错误数据值或控制信号的标签。如果模块的内部设计有完整的正式规范,则错误就是违反规范的某些断言或不变量。模块中的错误并不等同于该模块的故障,但如果错误未被屏蔽,则可能导致模块故障。[第 8 章]
Error Informally, a label for an incorrect data value or control signal caused by an active fault. If there is a complete formal specification for the internal design of a module, an error is a violation of some assertion or invariant of the specification. An error in a module is not identical to a failure of that module, but if an error is not masked, it may lead to a failure of the module. [Ch. 8]
错误遏制限制错误影响的传播范围。模块通常设计为以某种方式遏制错误,使得错误的影响以可预测的方式出现在模块接口上。[第 8 章]
Error containment Limiting how far the effects of an error propagate. A module is normally designed to contain errors in such a way that the effects of an error appear in a predictable way at the module’s interface. [Ch. 8]
错误校正将错误的数据值或控制信号设置为正确值的方案。与错误检测相对。[第 8 章]
Error correction A scheme to set to the correct value a data value or control signal that is in error. Compare with error detection. [Ch. 8]
纠错码一种对存储或传输的数据进行编码的方法,采用适度冗余,这样存储或传输过程中的任何错误都很有可能导致与原始数据相同的解码。另请参阅纠错的一般定义。与错误检测码进行比较。[第 7 章]
Error-correction code A method of encoding stored or transmitted data with a modest amount of redundancy, in such a way that any errors during storage or transmission will, with high probability, lead to a decoding that is identical to the original data. See also the general definition of error correction. Compare with error-detection code. [Ch. 7]
错误检测发现数据值或控制信号错误的方案。与错误校正比较。[第 8 章]
Error detection A scheme to discover that a data value or control signal is in error. Compare with error correction. [Ch. 8]
错误检测码一种对存储或传输的数据进行编码的方法,使用少量冗余,这样存储或传输过程中的任何错误都很有可能导致明显错误的解码。另请参阅错误检测的一般定义。与错误校正码和校验和进行比较。[第 7 章]
Error-detection code A method of encoding stored or transmitted data with a small amount of redundancy, in such a way that any errors during storage or transmission will, with high probability, lead to a decoding that is obviously wrong. See also the general definition of error detection. Compare with error-correction code and checksum. [Ch. 7]
以太网一种广泛使用的广播网络,所有参与者共用一条公共线路,可以听到彼此的传输。以太网的特点是传输协议,希望发送数据的站点首先进行监听,以确保没有其他人正在发送数据,然后在自己的传输过程中继续监视网络,以查看是否有其他站点试图同时传输数据,这种错误称为冲突。此协议称为带冲突检测的载波侦听多路访问,缩写为 CSMA/CD。[第 7 章]
Ethernet A widely used broadcast network in which all participants share a common wire and can hear one another transmit. Ethernet is characterized by a transmit protocol in which a station wishing to send data first listens to ensure that no one else is sending, and then continues to monitor the network during its own transmission to see if some other station has tried to transmit at the same time, an error known as a collision. This protocol is named Carrier Sense Multiple Access with Collision Detection, abbreviated CSMA/CD. [Ch. 7]
Eventcount一种用于序列协调的特殊类型的共享变量。它支持两个主要操作:AWAIT和ADVANCE 。Eventcount 是一个使用ADVANCE原子递增的计数器,而其他线程则使用AWAIT等待计数器达到某个值。Eventcount 通常与序列器结合使用。[第 5 章]
Eventcount A special type of shared variable used for sequence coordination. It supports two primary operations: AWAIT and ADVANCE. An eventcount is a counter that is incremented atomically, using ADVANCE, while other threads wait for the counter to reach a certain value using AWAIT. Eventcounts are often used in combination with sequencers. [Ch. 5]
最终一致性要求在对数据集合进行更新后的某个未指定时间内,如果没有其他更新,则该集合的内存模型将保持不变。[第 10 章]
Eventual consistency A requirement that at some unspecified time following an update to a collection of data, if there are no more updates, the memory model for that collection will hold. [Ch. 10]
恰好一次一种协议保证,预期的操作或消息传递至少执行一次和最多执行一次。[第 4 章]
Exactly-once A protocol assurance that the intended operation or message delivery was performed both at-least-once and at-most-once. [Ch. 4]
异常与处理器当前正在运行的线程有关的中断事件。[第 5 章]
Exception An interrupt event that pertains to the thread that a processor is currently running. [Ch. 5]
显式上下文引用对于名称或对象,与要解析该名称或该对象中包含的所有名称的上下文关联的引用。与默认上下文引用比较。 [Ch. 2]
Explicit context reference For a name or an object, an associated reference to the context in which that name, or all names contained in that object, are to be resolved. Compare with default context reference. [Ch. 2]
显式性安全协议中消息的属性:如果某条消息是显式的,则该消息包含接收方可靠地确定该消息是具有特定功能和参与者集的协议特定运行的一部分所需的所有信息。[第 11 章]
Explicitness A property of a message in a security protocol: if a message is explicit, then the message contains all the information necessary for a receiver to reliably determine that the message is part of a particular run of the protocol with a specific function and set of participants. [Ch. 11]
指数退避一种自适应程序,用于设置计时器,例如,等待拥塞消散。每次计时器设置被证明太小,该操作都会将其下一个计时器设置的长度加倍(或更一般地说,乘以大于 1 的常数)。目的是尽快获得合适的计时器值。另请参阅指数随机退避。[第 7 章]
Exponential backoff An adaptive procedure used to set a timer, for example, to wait for congestion to dissipate. Each time the timer setting proves to be too small, the action doubles (or, more generally, multiplies by a constant greater than one) the length of its next timer setting. The intent is to obtain a suitable timer value as quickly as possible. See also exponential random backoff. [Ch. 7]
指数随机退避一种指数退避形式,其中反复遇到干扰的操作会反复将间隔的大小加倍(或更一般地,乘以大于 1 的常数),然后从该间隔中随机选择下一个延迟,然后再重试。目的是通过随机改变相对于其他干扰操作的时间,使干扰不会再次发生。[第 9 章]
Exponential random backoff A form of exponential backoff in which an action that repeatedly encounters interference repeatedly doubles (or, more generally, multiplies by a constant greater than one) the size of an interval from which it randomly chooses its next delay before retrying. The intent is that by randomly changing the timing relative to other, interfering actions, the interference will not recur. [Ch. 9]
导出在命名中,为其他对象提供一个可以使用的对象名称。[Ch. 2]
Export In naming, to provide a name for an object that other objects can use. [Ch. 2]
快速失败描述一种系统或模块设计,通过在其接口上报告其输出可能不正确来包含检测到的错误。与故障停止相比较。[第 8 章]
Fail-fast Describes a system or module design that contains detected errors by reporting at its interface that its output may be incorrect. Compare with fail-stop. [Ch. 8]
故障安全描述一种系统设计,可检测不正确的数据值或控制信号,并强制将其设置为即使不正确也已知的值,以允许系统继续安全运行。[第 8 章]
Fail-safe Describes a system design that detects incorrect data values or control signals and forces them to values that, even if not correct, are known to allow the system to continue operating safely. [Ch. 8]
故障安全描述故障安全设计在信息保护中的应用:保证发生故障时不会允许未经授权访问受保护的信息。在早期的容错研究中,该术语有时也用作故障快速的同义词。[第 8 章]
Fail-secure Describes an application of fail-safe design to information protection: a failure is guaranteed not to allow unauthorized access to protected information. In early work on fault tolerance, this term was also occasionally used as a synonym for fail-fast. [Ch. 8]
故障软化描述一种设计,其中系统规范允许通过降低性能或以可预测的方式禁用某些功能来掩盖错误。[第 8 章]
Fail-soft Describes a design in which the system specification allows errors to be masked by degrading performance or disabling some functions in a predictable manner. [Ch. 8]
故障停止描述通过尽快停止系统或模块来包含检测到的错误的系统或模块设计。与故障快速相比,故障快速不需要其他模块采取额外措施(例如设置计时器)来检测故障。[第 8 章]
Fail-stop Describes a system or module design that contains detected errors by stopping the system or module as soon as possible. Compare with fail-fast, which does not require other modules to take additional action, such as setting a timer, to detect the failure. [Ch. 8]
失败投票描述具有多数投票者的N模块冗余系统。[第 8 章]
Fail-vote Describes an N-modular redundancy system with a majority voter. [Ch. 8]
故障组件或系统未在其接口处产生预期结果时发生的结果。与故障比较。[第 8 章]
Failure The outcome when a component or system does not produce the intended result at its interface. Compare with fault. [Ch. 8]
容错性 衡量系统屏蔽活动故障并继续正常运行的能力的指标。典型的衡量指标是计算可发生故障但不会导致系统故障的组件数量。[第 8 章]
Failure tolerance A measure of a system’s ability to mask active faults and continue operating correctly. A typical measure counts the number of contained components that can fail without causing the system to fail. [Ch. 8]
故障材料、设计或实施中的缺陷,可能(或不会)导致错误并导致故障。与故障比较。[第 8 章]
Fault A defect in materials, design, or implementation that may (or may not) cause an error and lead to a failure. Compare with failure. [Ch. 8]
故障避免一种设计和实现组件的策略,其故障概率很低,可以忽略不计。当应用于软件时,故障避免有时被称为有效构造。[第 8 章]
Fault avoidance A strategy to design and implement a component with a probability of faults that is so low that it can be neglected. When applied to software, fault avoidance is sometimes called valid construction. [Ch. 8]
容错一组技术,涉及注意活动故障和较低级别子系统故障并屏蔽它们,而不是允许由此产生的错误传播。[第 8 章]
Fault tolerance A set of techniques that involve noticing active faults and lower-level subsystem failures and masking them, rather than allowing the resulting errors to propagate. [Ch. 8]
文件一种流行的内存抽象,用于持久存储和检索数据。文件的典型接口包括打开文件、读取和写入文件区域以及关闭文件的程序。[第 2 章]
File A popular memory abstraction to durably store and retrieve data. A typical interface for a file consists of procedures to OPEN the file, to READ and WRITE regions of the file, and to CLOSE the file. [Ch. 2]
指纹证人的另一种说法。[第 10 章]
Fingerprint Another term for a witness. [Ch. 10]
先到先服务 (FCFS) 调度策略一种按照请求到达的顺序进行处理的调度策略。[Ch. 6]
First-come, first-served (FCFS) scheduling policy A scheduling policy in which requests are processed in the order in which they arrive. [Ch. 6]
先进先出 (FIFO) 策略多级内存系统的一种特殊页面移除策略。FIFO 选择移除在主设备中停留时间最长的页面。[第 6 章]
First-in, first-out (FIFO) policy A particular page-removal policy for a multilevel memory system. FIFO chooses to remove the page that has been in the primary device the longest. [Ch. 6]
流量控制1. 网络中快速发送方与慢速接收方之间的端到端协议,这是一种限制发送方数据速率的机制,可防止接收方以超出其处理能力的速度接收数据。[第 7 章] 2. 在安全领域,一种允许不受信任的程序处理敏感数据但限制所有程序输出以防止未经授权泄露的系统。[第 11 章]
Flow control 1. In networks, an end-to-end protocol between a fast sender and a slow recipient, a mechanism that limits the sender’s data rate so that the recipient does not receive data faster than it can handle. [Ch. 7] 2. In security, a system that allows untrusted programs to work with sensitive data but confines all program outputs to prevent unauthorized disclosure. [Ch. 11]
强制(v.) 当输出可以缓冲时,确保先前的输出值已实际写入持久存储或作为消息发送。非直写式缓存通常具有允许调用者将其部分或全部内容强制写入辅助存储介质的功能。[第 9 章]
Force (v.) When output may be buffered, to ensure that a previous output value has actually been written to durable storage or sent as a message. Caches that are not write-through usually have a feature that allows the invoker to force some or all of their contents to the secondary storage medium. [Ch. 9]
前向纠错一种控制错误的技术,在错误发生之前应用足够的冗余来纠正预期的错误。当无法重新计算或重新发送数据值或控制信号的原始源时,前向纠错尤其适用。与后向纠错进行比较。[第 8 章]
Forward error correction A technique for controlling errors in which enough redundancy to correct anticipated errors is applied before an error occurs. Forward error correction is particularly applicable when the original source of the data value or control signal will not be available to recalculate or resend it. Compare with backward error correction. [Ch. 8]
前向保密性安全协议的一种属性。如果从先前的记录中推断出的信息(例如加密密钥)不允许对手解密未来的消息,则协议具有前向保密性。[第 11 章]
Forward secrecy A property of a security protocol. A protocol has forward secrecy if information, such as an encryption key, deduced from a previous transcript, doesn’t allow an adversary to decrypt future messages. [Ch. 11]
转发表根据数据包的目标地址告知网络层使用哪条链路转发数据包的表。[第 7 章]
Forwarding table A table that tells the network layer which link to use to forward a packet, based on its destination address. [Ch. 7]
片段1.(动词)在网络协议中,将数据包的有效负载分成多个小数据包,以便通过最大传输单元较小的链路进行传输。 2.(名词)生成的有效负载片段。[第 7 章]
Fragment 1. (v.) In network protocols, to divide the payload of a packet so that it can fit into smaller packets for carriage across a link with a small maximum transmission unit. 2. (n.) The resulting pieces of payload. [Ch. 7]
帧1.(名词)链路层中的传输单位。与数据包、段和消息相比较。 2.(动词)界定流中位、字节、帧(名词)、数据包、段或消息的开始和结束。[第 7 章]
Frame 1. (n.) The unit of transmission in the link layer. Compare with packet, segment, and message. 2. (v.) To delimit the beginning and end of a bit, byte, frame (n.), packet, segment, or message within a stream. [Ch. 7]
新鲜度安全协议中消息的属性:如果消息是新鲜的,则可以确保它不是重放消息。[第 11 章]
Freshness A property of a message in a security protocol: if the message is fresh, it is assured not to be a replay. [Ch. 11]
全双工描述两个站之间的双工链路或连接,可同时在两个方向上使用。与单工、双工和半双工进行比较。[第 7 章]
Full-duplex Describes a duplex link or connection between two stations that can be used in both directions at the same time. Compare with simplex, duplex, and half-duplex. [Ch. 7]
门进入域的预定义受保护入口点。[第 5 章]
Gate A predefined protected entry point into a domain. [Ch. 5]
生成名称通过算法生成的名称,而不是由人选择的名称。[第 3 章]
Generated name A name created algorithmically, rather than chosen by a person. [Ch. 3]
全局名称在分层命名方案中,仅绑定在最外层上下文中的名称,因此对所有用户而言具有相同的含义。[第 2 章]
Global name In a layered naming scheme, a name that is bound only in the outermost context layer and thus has the same meaning to all users. [Ch. 2]
半双工描述两个站之间的双工链路或连接,每次只能在一个方向上使用。与单工、双工和全双工进行比较。[第 7 章]
Half-duplex Describes a duplex link or connection between two stations that can be used in only one direction at a time. Compare with simplex, duplex, and full-duplex. [Ch. 7]
汉明距离在编码系统中,代码元素中必须改变的位数才能将其转换为代码的不同元素。代码的汉明距离是代码中任何一对元素之间的最小汉明距离。[第 8 章]
Hamming distance In an encoding system, the number of bits in an element of a code that would have to change to transform it into a different element of the code. The Hamming distance of a code is the minimum Hamming distance between any pair of elements of the code. [Ch. 8]
硬实时调度策略一种实时调度程序,如果错过截止期限则可能导致灾难。[第 6 章]
Hard real-time scheduling policy A real-time scheduler in which missing a deadline may result in a disaster. [Ch. 6]
哈希函数一种算法,从任意大的数据块中生成相对较短、固定长度的位串。生成的短字符串称为哈希值。另请参阅加密哈希函数。[第 3 章]
Hash function A function that algorithmically derives a relatively short, fixed-length string of bits from an arbitrarily large block of data. The resulting short string is known as a hash. See also cryptographic hash function. [Ch. 3]
报头:协议层添加到数据包前面的信息。[第 7 章]
Header Information that a protocol layer adds to the front of a packet. [Ch. 7]
分层路由一种利用分层分配的网络目标地址来减少路由表大小的路由系统。[第 7 章]
Hierarchical routing A routing system that takes advantage of hierarchically assigned network destination addresses to reduce the size of its routing tables. [Ch. 7]
层次结构一种组织包含许多组件的系统的技术:将少量组件分组为独立且稳定的子系统,然后这些子系统成为更大的独立且稳定的子系统的组件,依此类推。[第 1 章]
Hierarchy A technique of organizing systems that contain many components: group small numbers of components into self-contained and stable subsystems that then become components of larger self-contained and stable subsystems, and so on. [Ch. 1]
命中率在多级存储器中,主存储器设备满足的引用比例。[第 6 章]
Hit ratio In a multilevel memory, the fraction of references satisfied by the primary memory device. [Ch. 6]
跳数限制一种网络层协议字段,可充当安全网,防止数据包在具有不一致转发表的网络中无休止地循环。[第 7 章]
Hop limit A network-layer protocol field that acts as a safety net to prevent packets from endlessly circulating in a network that has inconsistent forwarding tables. [Ch. 7]
热插拔在系统继续提供服务的同时更换系统中的模块。[第 8 章]
Hot swap To replace modules in a system while the system continues to provide service. [Ch. 8]
幂等描述可以多次中断并从头开始重新执行的操作,并且仍然会产生与操作未中断地完成相同的结果。幂等操作的基本特征是,如果对操作是否完成存在任何疑问,则可以安全地再次执行。“幂等”的正确发音是重音在第二个音节上,而不是第一个和第三个音节上。[第 4 章]
Idempotent Describes an action that can be interrupted and restarted from the beginning any number of times and still produce the same result as if the action had run to completion without interruption. The essential feature of an idempotent action is that if there is any question about whether or not it completed, it is safe to do it again. “Idempotent” is correctly pronounced with the accent on the second syllable, not on the first and third. [Ch. 4]
标识符名称的同义词,有时用于避免暗示名称对人有意义,而对机器却没有意义。[第 3 章]
Identifier A synonym for name, sometimes used to avoid an implication that the name might be meaningful to a person rather than to a machine. [Ch. 3]
非法指令解释器无法执行的指令,因为它不在解释器的指令库中,或者操作数超出范围(例如,试图除以零)。非法指令通常会导致中断。[第 2 章]
Illegal instruction An instruction that an interpreter is not equipped to execute because it is not in the interpreter’s instruction repertoire or it has an out-of-range operand (for example, an attempt to divide by zero). An illegal instruction typically causes an interrupt. [Ch. 2]
不相称的扩展大多数系统的一个特性是,随着系统规模的扩大(或缩小),并非所有部分都会以相同的速率扩大(或缩小),从而给系统设计带来压力。[第 1 章]
Incommensurate scaling A property of most systems, that as the system grows (or shrinks) in size, not all parts grow (or shrink) at the same rate, thus stressing the system design. [Ch. 1]
增量备份仅包含自上次备份以来发生变化的数据的备份副本。[第 10 章]
Incremental backup A backup copy that contains only data that has changed since making the previous backup copy. [Ch. 10]
间接名称绑定到同一名称空间中的另一个名称的名称。“符号链接”、“软链接”和“快捷方式”是用于表示此概念的其他词语。某些操作系统还将术语别名定义为具有此含义,而不是其更一般的同义词含义。[第 2 章]
Indirect name A name that is bound to another name in the same name space. “Symbolic link”, “soft link”, and “shortcut” are other words used for this concept. Some operating systems also define the term alias to have this meaning rather than its more general meaning of synonym. [Ch. 2]
间接通过插入名称来将一个对象与另一个对象之间的连接解耦,目的是延迟选择该名称所指的对象(或允许稍后更改)。间接使得延迟选择或更改使用哪个对象成为可能,而无需更改使用该对象的对象。使用名称有时被描述为“插入一个间接级别”。[Ch. 1]
Indirection Decoupling a connection from one object to another by interposing a name with the goal of delaying the choice of (or allowing a later change about) which object the name refers to. Indirection makes it possible to delay the choice of or change which object is used without the need to change the object that uses it. Using a name is sometimes described as “inserting a level of indirection”. [Ch. 1]
安装在使用日志实现全有或全无原子性的系统中,将数据写入单元存储。[第 9 章]
Install In a system that uses logs to achieve all-or-nothing atomicity, to write data to cell storage. [Ch. 9]
指令引用解释器的一个典型组成部分:它从哪里获取下一条指令。[第 2 章]
Instruction reference A characteristic component of an interpreter: the place from which it will take its next instruction. [Ch. 2]
预期负载如果资源容量无限,一组用户会尝试利用的共享资源量。在没有拥塞控制的系统中,预期负载等于提供的负载。拥塞控制的目标是使提供的负载小于预期负载。与提供的负载进行比较。[第 7 章]
Intended load The amount of a shared resource that a set of users would attempt to utilize if the resource had unlimited capacity. In systems that have no provision for congestion control, the intended load is equal to the offered load. The goal of congestion control is to make the offered load smaller than the intended load. Compare with offered load. [Ch. 7]
交错一种通过将看似连续的请求分发到设备的多个实例来提高性能的技术,以便请求实际上可以同时处理。[第 6 章]
Interleaving A technique to improve performance by distributing apparently sequential requests to several instances of a device, so that the requests may actually be processed concurrently. [Ch. 6]
间歇性故障仅偶尔发生的持续性故障。与瞬时性故障相比较。[第 8 章]
Intermittent fault A persistent fault that is active only occasionally. Compare with transient fault. [Ch. 8]
国际标准化组织 (ISO)一个国际非政府机构,负责制定许多技术和制造标准,包括(经常被忽视的)数据通信网络开放系统互连 (OSI) 参考模型。简称 ISO 不是首字母缩略词;它是希腊语中的“平等”一词,在所有语言中都选择相同的形式,并且始终以大写字母拼写。[第 7 章]
International Organization for Standardization (ISO) An international non-governmental body that sets many technical and manufacturing standards, including the (frequently ignored) Open Systems Interconnect (OSI) reference model for data communication networks. The short name ISO is not an acronym; it is the Greek word for “equal”, chosen to be the same in all languages and always spelled in all capital letters. [Ch. 7]
解释器:对执行计算的活动机制进行建模的抽象。解释器由三个组件组成:指令引用、上下文引用和指令库。[第 2 章]
Interpreter The abstraction that models the active mechanism performing computations. An interpreter comprises three components: an instruction reference, a context reference, and an instruction repertoire. [Ch. 2]
中断导致解释器将控制权转移到另一个程序(中断处理程序)的第一条指令而不是执行下一条指令的事件。[第 2 章]
Interrupt An event that causes an interpreter to transfer control to the first instruction of a different procedure, an interrupt handler, instead of executing the next instruction. [Ch. 2]
使无效在缓存中,标记“不使用”或完全删除缓存条目,因为发生的某些事件可能导致与该条目关联的值不正确。[Ch. 10]
Invalidate In a cache, to mark “do not use” or completely remove a cache entry because some event has occurred that may make the value associated with that entry incorrect. [Ch. 10]
等时(源自希腊语词根,意为“相等”和“时间”)描述一种通信链路,通过该链路,数据以帧的形式发送,帧的长度预先固定,并且相对于其他帧的时间可以精确预测。与异步进行比较。[第 7 章]
Isochronous (From Greek roots meaning “equal” and “time”) Describes a communication link over which data is sent in frames whose length is fixed in advance and whose timing relative to other frames is precisely predictable. Compare with asynchronous. [Ch. 7]
抖动在实时应用中,连续数据元素的传送时间存在差异。[第 7 章]
Jitter In real-time applications, variability in the delivery times of successive data elements. [Ch. 7]
作业线程调度的粒度单位。作业对应于两个空闲周期之间线程的活动突发。[第 6 章]
Job The unit of granularity on which threads are scheduled. A job corresponds to the burst of activity of a thread between two idle periods. [Ch. 6]
日志存储WRITE或PUT操作会附加新值,而不是覆盖先前存储的值的存储。与单元存储相比较。[第 9 章]
Journal storage Storage in which a WRITE or PUT appends a new value, rather than overwriting a previously stored value. Compare with cell storage. [Ch. 9]
内核为在同一台计算机上运行的互不信任的模块虚拟化资源的可信中介。内核模块通常在启用内核模式的情况下运行。[第 5 章]
Kernel A trusted intermediary that virtualizes resources for mutually distrustful modules running on the same computer. Kernel modules typically run with kernel mode enabled. [Ch. 5]
内核模式处理器的一种功能,设置后,允许线程使用特殊处理器功能(例如,页面映射地址寄存器),而禁用内核模式时,线程则无法使用这些功能。与用户模式相比较。[第 5 章]
Kernel mode A feature of a processor that, when set, allows threads to use special processor features (e.g., the page-map address register) that are disallowed to threads that run with kernel mode disabled. Compare with user mode. [Ch. 5]
基于密钥的加密转换加密转换能否成功实现加密目标取决于转换中某些组件的保密性。该组件称为加密密钥,通常的设计是将该密钥设计成一个小型、模块化、可分离且易于更改的组件。[第 11 章]
Key-based cryptographic transformation A cryptographic transformation for which successfully meeting the cryptographic goals depends on the secrecy of some component of the transformation. That component is called a cryptographic key, and a usual design is to make that key a small, modular, separable, and easily changeable component. [Ch. 11]
密钥分发中心 (KDC)一个主体,负责对其他主体进行身份验证,并提供一个或多个临时加密密钥,以便其他主体之间进行通信。[第 11 章]
Key distribution center (KDC) A principal that authenticates other principals to one another and also provides one or more temporary cryptographic keys for communication between other principals. [Ch. 11]
延迟系统输入变化与输出相应变化之间的延迟。[Ch. 2]在可靠性中,是指故障发生与发生故障的模块发生故障或检测到由此产生的错误之间的时间。[Ch. 8]
Latency The delay between a change at the input to a system and the corresponding change at its output. [Ch. 2] As used in reliability, the time between when a fault becomes active and when the module in which the fault occurred either fails or detects the resulting error. [Ch. 8]
潜在故障当前未导致错误的故障。与主动故障相对照。[第 8 章]
Latent fault A fault that is not currently causing an error. Compare with active fault. [Ch. 8]
分层一种组织系统的技术,设计人员在已经完成的界面(下层)上构建另一个完整的界面(上层)。[第 1 章]
Layering A technique of organizing systems in which the designer builds on an interface that is already complete (a lower layer) to create a different complete interface (an upper layer). [Ch. 1]
最近最少使用 (LRU) 策略一种流行的多级内存系统页面删除策略。LRU 选择删除最久未使用的页面。[第 6 章]
Least-recently-used (LRU) Policy A popular page-removal policy for a multilevel memory system. LRU chooses to remove the page that has not been used the longest. [Ch. 6]
词法作用域静态作用域的另一个术语。[第 2 章]
Lexical scope Another term for static scope. [Ch. 2]
有限名称空间只能表达有限数量名称的名称空间,因此必须分配、释放和重用名称。[第 3 章]
Limited name space A name space in which a limited number of names can be expressed and therefore names must be allocated, deallocated, and reused. [Ch. 3]
链接1.(名词)同义词(通常称为硬链接)或间接名称(通常称为软链接或符号链接)的另一个术语。 2.(动词)绑定的另一个术语。[第 2 章]。 3.(名词)数据通信中两点之间的通信路径。[第 7 章]
Link 1. (n.) Another term for a synonym (usually called a hard link) or an indirect name (usually called a soft or symbolic link). 2. (v.) Another term for bind. [Ch. 2]. 3. (n.) In data communication, a communication path between two points. [Ch. 7]
链路层将数据从一个物理点直接移动到另一个物理点的通信系统层。[第 7 章]
Link layer The communication system layer that moves data directly from one physical point to another. [Ch. 7]
列表系统一种访问控制系统的设计,其中每个受保护对象都与一个授权主体列表相关联。[第 11 章]
List system A design for an access control system in which each protected object is associated with a list of authorized principals. [Ch. 11]
活锁一组线程之间不希望出现的交互,其中每个线程开始执行一系列操作,发现由于其他线程的操作干扰而无法完成该序列,因此只能无休止地重新开始。[第 5 章]
Livelock An undesirable interaction among a group of threads in which each thread begins a sequence of actions, discovers that it cannot complete the sequence because actions of other threads have interfered, and begins again, endlessly. [Ch. 5]
引用局部性大多数程序的一个属性,即内存引用往往在时间和地址空间上聚集。[第 6 章]
Locality of reference A property of most programs that memory references tend to be clustered in both time and address space. [Ch. 6]
锁定与数据对象关联的标志,由线程设置,用于警告并发线程该对象正在使用中,其他线程读取或写入该对象可能是错误的。锁定是用于实现前后原子性的一项技术。[第 5 章]
Lock A flag associated with a data object, set by a thread to warn concurrent threads that the object is in use and that it may be a mistake for other threads to read or write it. Locks are one technique used to achieve before-or-after atomicity. [Ch. 5]
锁点在通过锁定提供前后原子性的系统中,前后操作中获取锁集中所有锁的第一个时刻。[第 9 章]
Lock point In a system that provides before-or-after atomicity by locking, the first instant in a before-or-after action when every lock that will ever be in its lock set has been acquired. [Ch. 9]
锁集在执行之前或之后操作期间获取的所有锁的集合。[第 9 章]
Lock set The collection of all locks acquired during the execution of a before-or-after action. [Ch. 9]
锁步协议在网络中,任何传输协议都要求在向同一目的地发送另一条消息、段、数据包或帧之前确认先前发送的消息、段、数据包或帧。有时称为停止和等待协议。与管道进行比较。[第 7 章]
Lock-step protocol In networking, any transport protocol that requires acknowledgment of the previously sent message, segment, packet, or frame before sending another message, segment, packet, or frame to the same destination. Sometimes called a stop and wait protocol. Compare with pipeline. [Ch. 7]
日志1.(名词)日志存储的一种特殊用途,用于维护某些应用程序活动的附加记录。日志用于执行全有或全无操作、性能增强、存档和协调。 2.(动词)将记录附加到日志中。[第 9 章]
Log 1. (n.) A specialized use of journal storage to maintain an append-only record of some application activity. Logs are used to implement all-or-nothing actions, for performance enhancement, for archiving, and for reconciliation. 2. (v.) To append a record to a log. [Ch. 9]
逻辑副本按照由更高层确定的形式组织的副本。例如,文件系统的副本是通过一次复制一个文件而创建的。类似于逻辑锁定。与物理副本进行比较。[第 10 章]
Logical copy A replica that is organized in a form determined by a higher layer. An example is a replica of a file system that is made by copying one file at a time. Analogous to logical locking. Compare with physical copy. [Ch. 10]
逻辑锁定对更高层数据对象(如数据库的记录或字段)进行锁定。与物理锁定进行比较。[第 9 章]
Logical locking Locking of higher-layer data objects such as records or fields of a database. Compare with physical locking. [Ch. 9]
曼彻斯特码一种特殊类型的相位编码,其中每个位由两个相反值的位表示。[第 7 章]
Manchester code A particular type of phase encoding in which each bit is represented by two bits of opposite value. [Ch. 7]
裕度 规格比正确操作所需的规格好的数量。设计时留有裕度的目的是为了掩盖一些错误。[第 8 章]
Margin The amount by which a specification is better than necessary for correct operation. The purpose of designing with margins is to mask some errors. [Ch. 8]
标记点1.(adj.)一种原子性保证规则,其中每个新创建的操作n必须等待,直到操作(n − 1)标记了它打算修改的所有变量,才能开始读取共享数据对象。 2.(n.)操作标记了它打算修改的所有变量的瞬间。[第 9 章]
Mark point 1. (adj.) An atomicity-assuring discipline in whicheach newly created action n must wait to beginreading shared data objects until action (n − 1) has marked all of the variables it intends to modify. 2. (n.) The instant at which an action has marked all of the variables it intends to modify. [Ch. 9]
编组/解组编组是将一个或多个数据片段的内部表示转换为更适合传输或存储的形式。相反的操作(解组)是将编组后的数据解析为其组成数据片段,并将这些片段转换为合适的内部表示。[第 4 章]
Marshal/unmarshal To marshal is to transform the internal representation of one or more pieces of data into a form that is more suitable for transmission or storage. The opposite action, to unmarshal, is to parse marshaled data into its constituent data pieces and transform those pieces into a suitable internal representation. [Ch. 4]
可屏蔽错误 可检测的错误或错误类别,原则上可以为其设计系统性恢复策略。与可检测错误和可容忍错误进行比较。[第 8 章]
Maskable error An error or class of errors that is detectable and for which a systematic recovery strategy can in principle be devised. Compare with detectable error and tolerated error. [Ch. 8]
掩蔽在可靠性领域中,将错误限制在模块内,使模块符合其规格,就好像没有发生过错误一样。[第 8 章]
Masking As used in reliability, containing an error within a module in such a way that the module meets its specifications as if the error had not occurred. [Ch. 8]
主服务器在多站点复制方案中,更新指向的站点。与从服务器比较。[第 10 章]
Master In a multiple-site replication scheme, the site to which updates are directed. Compare with slave. [Ch. 10]
最大传输单元 (MTU)对数据包大小的限制,旨在控制传输数据包所需的时间,控制因拥塞导致数据包被丢弃时的丢失量,并降低传输错误的概率。[第 7 章]
Maximum transmission unit (MTU) A limit on the size of a packet, imposed to control the time commitment involved in transmitting the packet, to control the amount of loss if congestion causes the packet to be discarded, and to keep low the probability of a transmission error. [Ch. 7]
平均故障间隔时间 (MTBF)同一组件或系统的 MTTF 与 MTTR 之和。[第 8 章]
Mean time between failures (MTBF) The sum of MTTF and MTTR for the same component or system. [Ch. 8]
平均无故障时间 (MTTF)组件或系统连续运行而不发生故障的预期时间。“时间”有时以运行周期来衡量。[第 8 章]
Mean time to failure (MTTF) The expected time that a component or system will operate continuously without failing. “Time” is sometimes measured in cycles of operation. [Ch. 8]
平均修复时间 (MTTR)更换或修复发生故障的组件或系统的预期时间。该术语有时写为“恢复服务的平均时间”,但仍缩写为 MTTR。[第 8 章]
Mean time to repair (MTTR) The expected time to replace or repair a component or system that has failed. The term is sometimes written as “mean time to restore service”, but it is still abbreviated MTTR. [Ch. 8]
调解在服务执行请求的操作之前,确定哪个主体与请求相关联以及该主体是否被授权请求该操作。[第 11 章]
Mediation Before a service performs a requested operation, determining which principal is associated with the request and whether the principal is authorized to request the operation. [Ch. 11]
内存使用READ和WRITE操作来记住数据值的抽象。WRITE操作指定要记住的值以及将来可以调用该值的名称。另请参阅存储。[第 2 章]
Memory The abstraction for remembering data values, using READ and WRITE operations. The WRITE operation specifies a value to be remembered and a name by which that value can be recalled in the future. See also storage. [Ch. 2]
无记忆某些时间相关概率过程的一种特性,即接下来发生的事情的概率不依赖于之前发生的事情。[第 8 章]
Memoryless A property of some time-dependent probabilistic processes, that the probability of what happens next does not depend on what has happened before. [Ch. 8]
内存管理器位于处理器和内存之间的设备,用于将虚拟地址转换为物理地址,并检查处理器上运行的线程所引用的内存是否在该线程的域内。[第 5 章]
Memory manager A device located between a processor and memory that translates virtual to physical addresses and checks that memory references by the thread running on the processor are in the thread’s domain(s). [Ch. 5]
内存映射 I/O一种接口,允许解释器使用具有普通内存地址的LOAD和STORE指令与 I/O 模块进行通信。 [Ch. 2]
Memory-mapped I/O An interface that allows an interpreter to communicate with an I/O module using LOAD and STORE instructions that have ordinary memory addresses. [Ch. 2]
消息应用程序级别的通信单位。消息的长度由发送消息的应用程序决定。由于网络的传输单位可能存在最大大小,因此端到端层将消息划分为一个或多个段,每个段都包含在单独的数据包中。请与帧(n.)、段和数据包进行比较。[第 7 章]
Message The unit of communication at the application level. The length of a message is determined by the application that sends it. Since a network may have a maximum size for its unit of transmission, the end-to-end layer divides a message into one or more segments, each of which is carried in a separate packet. Compare with frame (n.), segment, and packet. [Ch. 7]
消息认证验证消息来源和数据的完整性。[第 11 章]
Message authentication The verification of the integrity of the origin and the data of a message. [Ch. 11]
消息认证码 (MAC)使用共享秘密加密计算的认证标签。MAC 有时在安全术语中用作动词,例如“为了小心起见,让我们对该消息的地址字段进行 MAC 处理。”[第 11 章]
Message authentication code (MAC) An authentication tag computed with shared-secret cryptography. MAC is sometimes used as a verb in security jargon, as in “Just to be careful, let’s MAC the address field of that message.” [Ch. 11]
元数据有关对象的信息,但不属于对象本身。例如,对象的名称、所有者的身份、上次修改的日期以及存储的位置。[第 3 章]
Metadata Information about an object that is not part of the object itself. Examples are the name of the object, the identity of its owner, the date it was last modified, and the location in which it is stored. [Ch. 3]
微内核一种内核组织,其中大多数操作系统组件在单独的用户模式地址空间中运行。[第 5 章]
Microkernel A kernel organization in which most operating system components run in separate, user-mode address spaces. [Ch. 5]
镜像(n.)同步创建或更新的一组副本之一。与主副本和备份副本相对照。有时用作动词,例如“让我们通过制作三个副本来镜像该数据。”[第 8 章]
Mirror (n.) One of a set of replicas that is created or updated synchronously. Compare with primary copy and backup copy. Sometimes used as a verb, as in “Let’s mirror that data by making three replicas.” [Ch. 8]
缺页异常当主设备中不存在寻址页面时发生的事件,虚拟内存管理器必须从辅助设备中移入该页面。文献中也使用术语“页面错误”。[第 6 章]
Missing-page exception The event when an addressed page is not present in the primary device and the virtual memory manager has to move the page in from a secondary device. The literature also uses the term page fault. [Ch. 6]
模块化共享无需了解共享对象实现细节即可共享对象。就命名而言,模块化共享是指无需了解共享对象用于引用其组件的名称即可共享对象。[第 3 章]
Modular sharing Sharing of an object without the need to know details of the implementation of the shared object. With respect to naming, modular sharing is sharing without the need to know the names that the shared object uses to refer to its components. [Ch. 3]
模块可以单独设计、实施、管理和替换的系统组件。[第 1 章]
Module A system component that can be separately designed, implemented, managed, and replaced. [Ch. 1]
单片内核一种内核组织,其中大多数操作系统程序在单个内核模式地址空间中运行。[第 5 章]
Monolithic kernel A kernel organization in which most operating system procedures run in a single, kernel-mode address space. [Ch. 5]
最近使用 (MRU) 策略多级内存系统的页面移除策略。MRU 选择移除主设备中最近使用的页面。[第 6 章]
Most-recently-used (MRU) policy A page-removal policy for a multilevel memory system. MRU chooses for removal the most recently used page in the primary device. [Ch. 6]
MTU 发现系统地发现两个网络连接点之间路径上的最小最大传输单元的过程。[第 7 章]
MTU discovery A procedure that systematically discovers the smallest maximum transmission unit along the path between two network attachment points. [Ch. 7]
多宿主描述网络层和端到端层之间的单个物理接口,该接口与多个网络连接点相关联,每个网络连接点都有自己的网络层地址。[第 7 章]
Multihomed Describes a single physical interface between the network layer and the end-to-end layer that is associated with more than one network attachment point, each with its own network-layer address. [Ch. 7]
多级存储器由两个或多个不同的存储设备构建而成的存储器,这些存储设备的延迟和每位成本有显著差异。[第 6 章]
Multilevel memory Memory built out of two or more different memory devices that have significantly different latencies and cost per bit. [Ch. 6]
多重查找一种名称映射算法,它按顺序尝试多个上下文,寻找第一个可以成功解析所呈现名称的上下文。[Ch. 2]
Multiple lookup A name-mapping algorithm that tries several contexts in sequence, looking for the first one that can successfully resolve a presented name. [Ch. 2]
多路复用在多个(通常是独立的)同时通信之间共享通信链路。当多个不同的高层协议共享相同的低层协议时,该术语也用于分层协议设计。[第 7 章]
Multiplexing Sharing a communication link among several, usually independent, simultaneous communications. The term is also used in layered protocol design when several different higher-layer protocols share the same lower-layer protocol. [Ch. 7]
多点描述涉及两方以上的通信。多点链路是连接多方的单一物理介质。多点协议协调三方或更多参与者的活动。[第 7 章]
Multipoint Describes communication that involves more than two parties. A multipoint link is a single physical medium that connects several parties. A multipoint protocol coordinates the activities of three or more participants. [Ch. 7]
N + 1 冗余当负载可以通过在N 个等效模块之间共享来处理时,安装N + 1 个或更多模块的技术,这样如果一个模块发生故障,其余模块可以继续处理全部负载,同时修复发生故障的模块。[第 8 章]
N + 1 redundancy When a load can be handled by sharing it among N equivalent modules, the technique of installing N + 1 or more of the modules, so that if one fails the remaining modules can continue to handle the full load while the one that failed is being repaired. [Ch. 8]
N 模块冗余 (NMR)一种冗余技术,涉及向N 个等效模块提供相同的输入,并将输出连接到一个或多个表决器。[第 8 章]
N-modular redundancy (NMR) A redundancy technique that involves supplying identical inputs to N equivalent modules and connecting the outputs to one or more voters. [Ch. 8]
N 版本编程N模块冗余的软件版本。N个不同的团队各自根据其规范编写程序。然后程序并行运行,投票者比较它们的输出。[第 8 章]
N-version programming The software version of N-modular redundancy. N different teams each independently write a program from its specifications. The programs then run in parallel, and voters compare their outputs. [Ch. 8]
名称对象或值的指示符或标识符。名称是名称空间的一个元素。[第 2 章]
Name A designator or an identifier of an object or value. A name is an element of a name space. [Ch. 2]
名称冲突出于某种原因,似乎有必要在同一上下文中同时将同一名称绑定到两个不同的值。通常是由于在不提供模块共享的命名方案中遇到预先存在的名称而导致的。当名称是通过算法生成的时,名称冲突称为冲突。[第 3 章]
Name conflict An occurrence when, for some reason, it seems necessary to bind the same name to two different values at the same time in the same context. Usually, a result of encountering a preexisting name in a naming scheme that does not provide modular sharing. When names are algorithmically generated, name conflicts are called collisions. [Ch. 3]
名称映射算法请参阅命名方案。 [第 2 章]
Name-mapping algorithm See naming scheme. [Ch. 2]
名称空间特定命名方案的所有可能名称的集合。名称空间由来自某个字母表的一组符号以及一组语法规则定义,这些语法规则定义哪些名称是名称空间的成员。[第 2 章]
Name space The set of all possible names of a particular naming scheme. A name space is defined by a set of symbols from some alphabet together with a set of syntax rules that define which names are members of the name space. [Ch. 2]
名称到密钥绑定主体标识符和加密密钥之间的绑定。[第 11 章]
Name-to-key binding A binding between a principal identifier and a cryptographic key. [Ch. 11]
命名层次结构受限于树结构形式的命名网络。用于解释绝对路径名(在命名层次结构中有时称为“树名”)的根通常是树的根。[第 2 章]
Naming hierarchy A naming network that is constrained to a tree-structured form. The root used for interpretation of absolute path names (which in a naming hierarchy are sometimes called “tree names”) is normally the base of the tree. [Ch. 2]
命名网络一种命名方案,其中上下文是命名对象,任何上下文都可以包含任何其他上下文以及任何非上下文对象的绑定。命名网络中的对象由多组件路径名标识,该路径名从某个起点跟踪命名网络中的路径,该起点可以是默认上下文或根。[Ch. 2]
Naming network A naming scheme in which contexts are named objects and any context may contain a binding for any other context, as well as for any non-context object. An object in a naming network is identified by a multicomponent path name that traces a path through the naming network from some starting point, which may be either a default context or a root. [Ch. 2]
命名方案命名空间、可命名的值域(可能包括物理对象)和提供从命名空间到值域的部分映射的名称映射算法的特定组合。[第 2 章]
Naming scheme A particular combination of a name space, a universe of values (which may include physical objects) that can be named, and a name-mapping algorithm that provides a partial mapping from the name space to the universe of values. [Ch. 2]
否定确认(NAK 或 NACK)接收方向发送方发送的状态报告,表示未收到或错误接收了之前的通信。发送否定确认的通常原因是为了避免等待计时器到期而导致的延迟。与确认相比较。[第 7 章]
Negative acknowledgment (NAK or NACK) A status report from a recipient to a sender asserting that some previous communication was not received or was received incorrectly. The usual reason for sending a negative acknowledgment is to avoid the delay that would be incurred by waiting for a timer to expire. Compare with acknowledgment. [Ch. 7]
网络将两个以上事物连接起来的通信系统。[第 7 章]
Network A communication system that interconnects more than two things. [Ch. 7]
网络地址网络中数据包的源或目标的标识符。[第 7 章]
Network address In a network, the identifier of the source or destination of a packet. [Ch. 7]
网络附加点网络层接收或向端到端层传送有效载荷数据的位置。每个网络附加点都有一个标识符,即其地址,该标识符在该网络内是唯一的。网络附加点有时称为接入点,在 ISO 术语中称为网络服务接入点 (NSAP)。[第 7 章]
Network attachment point The place at which the network layer accepts or delivers payload data to and from the end-to-end layer. Each network attachment point has an identifier, its address, that is unique within that network. A network attachment point is sometimes called an access point, and in ISO terminology, a Network Services Access Point (NSAP). [Ch. 7]
网络层通过中间链路转发数据以将其传送到预定目的地的通信系统层。[第 7 章]
Network layer The communication system layer that forwards data through intermediate links to carry it to its intended destination. [Ch. 7]
非自主访问控制访问控制系统的一个属性。在非自主访问控制系统中,所有者以外的某个主体有权决定哪些主体可以访问对象。与自主访问控制进行比较。[第 11 章]
Non-discretionary access control A property of an access control system. In a non-discretionary access control system, some principal other than the owner has the authority to decide which principals have to access the object. Compare with discretionary access control. [Ch. 11]
非抢占式调度一种调度策略,其中线程一直运行直到它们明确让步或等待。[第 5 章]
Non-preemptive scheduling A scheduling policy in which threads run until they explicitly yield or wait. [Ch. 5]
非挥发性存储器一种不需要持续供电的存储器,因此即使电源关闭,其内容仍可保留。“稳定存储”一词是常用的同义词。与挥发性存储器相比较。[第 2 章]
Non-volatile memory A kind of memory that does not require a continuous source of power, so it retains its content when its power supply is off. The phrase “stable storage” is a common synonym. Compare with volatile memory. [Ch. 2]
Nonce一个绝不能重复使用的唯一标识符。[第 7 章]
Nonce A unique identifier that should never be reused. [Ch. 7]
对象在命名中,任何可以具有不同名称的软件或硬件结构。[Ch. 2]
Object As used in naming, any software or hardware structure that can have a distinct name. [Ch. 2]
提供负载一组用户尝试利用的共享服务量。呈现负载是偶尔遇到的同义词。[第 6 章]
Offered load The amount of a shared service that a set of users attempt to utilize. Presented load is an occasionally encountered synonym. [Ch. 6]
不透明名称在模块化系统中,从当前模块的角度来看,不带有模块知道如何解释的重载的名称。[Ch. 3]
Opaque name In a modular system, a name that, from the point of view of the current module, carries no overloading that the module knows how to interpret. [Ch. 3]
操作系统提供硬件设备抽象和管理等服务以及常用程序库等功能的程序集合,旨在使编写应用程序更加容易。[第 2 章]
Operating system A collection of programs that provide services such as abstraction and management of hardware devices and features such as libraries of commonly needed procedures, all of which are intended to make it easier to write application programs. [Ch. 2]
最优 (OPT) 页面移除策略一种无法实现的多级内存系统页面移除策略。最优策略会从主内存中移除最长时间不会使用的页面。由于识别该页面需要了解未来情况,因此最优策略在实践中无法实现。它的实用性在于,在观察到任何特定引用字符串后,可以使用最优策略模拟该引用字符串的操作,以将缺页异常的数量与使用其他可实现策略时获得的数量进行比较。[第 6 章]
Optimal (OPT) page-removal policy An unrealizable page-removal policy for a multilevel memory system. The optimal policy removes from primary memory the page that will not be used for the longest time. Because identifying that page requires knowing the future, the optimal policy is not implementable in practice. Its utility is that after any particular reference string has been observed, one can then simulate the operation of that reference string with the optimal policy, to compare the number of missing-page exceptions with the number obtained when using other, realizable policies. [Ch. 6]
乐观并发控制一种并发控制方案,允许并发线程继续运行,即使存在相互干扰的风险,计划检测是否真的存在干扰,并在必要时强制其中一个线程中止并重试。乐观并发控制是一种有效的技术,适用于可能但不太可能发生干扰的情况。与悲观并发控制进行比较。[第 9 章]
Optimistic concurrency control A concurrency control scheme that allows concurrent threads to proceed even though a risk exists that they will interfere with each other, with the plan of detecting whether there actually is interference and, if necessary, forcing one of the threads to abort and retry. Optimistic concurrency control is an effective technique in situations where interference is possible but not likely. Compare with pessimistic concurrency control. [Ch. 9]
来源真实性消息所声称来源的真实性。与数据完整性相对照。[第 11 章]
Origin authenticity Authenticity of the claimed origin of a message. Compare with data integrity. [Ch. 11]
过载当提供的负载在指定的时间内超出服务容量时。[第 6 章]
Overload When offered load exceeds the capacity of a service for a specified period of time. [Ch. 6]
重载名称一个名称不仅仅用来标识一个对象;它还包含其他信息,例如对象的类型、修改日期或如何定位它。当系统没有为处理元数据做出适当的安排时,通常会遇到重载。与纯名称相对。[第 3 章]
Overloaded name A name that does more than simply identify an object; it also carries other information, such as the type of the object, the date it was modified, or how to locate it. Overloading is commonly encountered when a system has not made suitable provision to handle metadata. Contrast with pure name. [Ch. 3]
数据包网络层的传输单位。数据包由一段有效载荷数据组成,并附带指导信息,使网络能够将其转发到旨在接收数据包中携带的数据的网络连接点。请与帧(n.)、段和消息进行比较。[第 7 章]
Packet The unit of transmission of the network layer. A packet consists of a segment of payload data, accompanied by guidance information that allows the network to forward it to the network attachment point that is intended to receive the data carried in the packet. Compare with frame (n.), segment, and message. [Ch. 7]
数据包转发在网络层中,当收到不是发往本地端层的数据包时,会沿着某条链路再次将其发送出去,以便将数据包移近其目的地。[第 7 章]
Packet forwarding In the network layer, upon receiving a packet that is not destined for the local end layer, to send it out again along some link with the intention of moving the packet closer to its destination. [Ch. 7]
分组交换机在数据通信网络中转发数据包的专用计算机。有时称为数据包转发器,或者如果它还实施自适应路由算法,则称为路由器。[第 7 章]
Packet switch A specialized computer that forwards packets in a data communication network. Sometimes called a packet forwarder or, if it also implements an adaptive routing algorithm, a router. [Ch. 7]
页面在基于页面的虚拟内存系统中,虚拟地址和物理地址之间的转换单位。[第 5 章]
Page In a page-based virtual memory system, the unit of translation between virtual addresses and physical addresses. [Ch. 5]
页面错误请参阅缺页异常。
Page fault See missing-page exception.
页面映射虚拟内存管理器使用的数据结构,将虚拟地址映射到物理地址。[第 5 章]
Page map Data structure employed by the virtual memory manager to map virtual addresses to physical addresses. [Ch. 5]
页映射地址寄存器由线程管理器维护的处理器寄存器。它包含指向当前活动线程使用的页映射的指针,并且仅当处理器处于内核模式时才可以更改。[第 5 章]
Page-map address register A processor register maintained by the thread manager. It contains a pointer to the page map used by the currently active thread, and it can be changed only when the processor is in kernel mode. [Ch. 5]
页面移除策略决定将哪个页面从主设备移至辅助设备以腾出空间来容纳新页面的策略。[Ch. 6]
Page-removal policy A policy for deciding which page to move from the primary to the secondary device to make a space to bring in a new page. [Ch. 6]
页表页面图的一种特殊形式,其中的页面图以按页码索引的数组形式组织。[第 5 章]
Page table A particular form of a page map, in which the map is organized as an array indexed by page number. [Ch. 5]
配对比较一种从不具备快速故障特性的模块构建快速故障模块的方法,方法是将模块的两个副本的输入连接在一起,并将其输出连接到比较器。当修复发生故障的配对比较模块时,方法是用备用模块替换整个双副本模块,而不是识别和替换发生故障的副本,这种方法称为配对备用。 [第 8 章]
Pair-and-compare A method for constructing fail-fast modules from modules that do not have that property, by connecting the inputs of two replicas of the module together and connecting their outputs to a comparator. When one repairs a failed pair-and-compare module by replacing the entire two-replica module with a spare, rather than identifying and replacing the replica that failed, the method is called pair-and-spare. [Ch. 8]
配对备用请参阅配对与比较。
Pair-and-spare See pair-and-compare.
并行传输一种通过在由同一时钟协调的几条并行线路上发送数据来提高两个模块之间数据速率的方案。[第 7 章]
Parallel transmission A scheme for increasing the data rate between two modules by sending data over several parallel lines that are coordinated by the same clock. [Ch. 7]
分区将作业划分并分配给不同的物理设备,目的是确保一个设备的故障不会影响整个作业的完成。[第 8 章]
Partition To divide a job up and assign it to different physical devices, with the intent that a failure of one device does not prevent the entire job from being done. [Ch. 8]
密码用于验证个人身份的秘密字符串。[第 11 章]
Password A secret character string used to authenticate the claimed identity of an individual. [Ch. 11]
路径名具有内部结构的名称,用于追踪命名网络中的路径。路径名的任何前缀都可以被视为用于解析路径名其余部分的显式上下文引用。另请参阅绝对路径名和相对路径名。[第 2 章]
Path name A name with internal structure that traces a path through a naming network. Any prefix of a path name can be thought of as the explicit context reference to use for resolution of the remainder of the path name. See also absolute path name and relative path name. [Ch. 2]
路径选择在网络层路由协议中,参与者使用从与邻居的交换中了解到的新信息来更新自己的路由信息。[第 7 章]
Path selection In a network-layer routing protocol, when a participant updates its own routing information with new information learned from an exchange with its neighbors. [Ch. 7]
有效负载在通信系统的分层描述中,上层要求下层发送的数据;用于将该数据与下层添加的报头和尾部区分开来。(该术语似乎是从运输行业借用的,在航空航天应用中经常使用。)[第 7 章]
Payload In a layered description of a communication system, the data that a higher layer has asked a lower layer to send; used to distinguish that data from the headers and trailers that the lower layer adds. (This term seems to have been borrowed from the transportation industry, where it is used frequently in aerospace applications.) [Ch. 7]
待定全有或全无操作的状态,即该操作尚未提交或中止。还用于描述由仍处于待定状态的全有或全无操作设置或更改的变量的值。[第 9 章]
Pending A state of an all-or-nothing action, when that action has not yet either committed or aborted. Also used to describe the value of a variable that was set or changed by a still-pending all-or-nothing action. [Ch. 9]
持久性主动代理(例如解释器)的一种属性,当它检测到失败时,它会继续尝试,直到成功。与稳定性和耐久性相比,这两个术语有不同的技术定义,如侧边栏 2.1中所述。形容词“持久性”在某些情况下用作稳定的同义词,有时也用作不变的意义。[第 2 章]
Persistence A property of an active agent such as an interpreter that, when it detects it has failed, it keeps trying until it succeeds. Compare with stability and durability, terms that have different technical definitions as explained in Sidebar 2.1. The adjective “persistent” is used in some contexts as a synonym for stable and sometimes also in the sense of immutable. [Ch. 2]
持续性故障无法通过重试掩盖的故障。请与瞬时性故障和间歇性故障进行比较。[第 8 章]
Persistent fault A fault that cannot be masked by retry. Compare with transient fault and intermittent fault. [Ch. 8]
持久发送者传输协议参与者,通过重复发送同一条消息来确保至少传递一份该消息的副本。[第 7 章]
Persistent sender A transport protocol participant that, by sending the same message repeatedly, tries to ensure that at least one copy of the message gets delivered. [Ch. 7]
悲观并发控制一种并发控制方案,如果某个线程的运行有可能干扰另一个并发线程,则强制该线程等待。悲观并发控制是一种有效的技术,适用于并发线程之间干扰可能性较高的情况。与乐观并发控制相比较。[第 9 章]
Pessimistic concurrency control A concurrency control scheme that forces a thread to wait if there is any chance that by proceeding it may interfere with another, concurrent, thread. Pessimistic concurrency control is an effective technique in situations where interference between concurrent threads has a high probability. Compare with optimistic concurrency control. [Ch. 9]
相位编码一种数字传输数据编码方法,其中每个传输位至少有一个电平转换与此相关,以简化帧传输和发送方时钟的恢复。[第 7 章]
Phase encoding A method of encoding data for digital transmission in which at least one level transition is associated with each transmitted bit, to simplify framing and recovery of the sender’s clock. [Ch. 7]
物理地址经过几何转换以读取或写入存储在设备上的数据的地址。与虚拟地址进行比较。[第 5 章]
Physical address An address that is translated geometrically to read or write data stored on a device. Compare with virtual address. [Ch. 5]
物理副本以由下层确定的形式组织的副本。例如,通过逐扇区复制磁盘而生成的磁盘副本。类似于物理锁定。与逻辑副本进行比较。[第 10 章]
Physical copy A replica that is organized in a form determined by a lower layer. An example is a replica of a disk that is made by copying it sector by sector. Analogous to physical locking. Compare with logical copy. [Ch. 10]
物理锁定锁定下层数据对象,通常是数据块,其范围由存储介质的物理布局决定。此类数据块的示例包括磁盘扇区,甚至整个磁盘。与逻辑锁定进行比较。[第 9 章]
Physical locking Locking of lower-layer data objects, typically chunks of data whose extent is determined by the physical layout of a storage medium. Examples of such chunks are disk sectors or even an entire disk. Compare with logical locking. [Ch. 9]
捎带在端到端协议中,通过在发往另一端的下一个数据包的报头中包含确认和其他协议状态信息来减少来回发送的数据包数量的一种技术。[第 7 章]
Piggybacking In an end-to-end protocol, a technique for reducing the number of packets sent back and forth by including acknowledgments and other protocol state information in the header of the next packet that goes to the other end. [Ch. 7]
管道在网络中,一种传输协议设计,允许在收到先前发送到同一目的地的数据包的确认之前发送数据包。与锁步协议相对。 [第 7 章]
Pipeline In networking, a transport protocol design that allows sending a packet before receiving an acknowledgment of the packet previously sent to the same destination. Contrast with lock-step protocol. [Ch. 7]
明文解密的结果。有时也用于描述未加密的数据,例如“错误在于以明文形式发送该消息”。请与密文进行比较。[第 11 章]
Plaintext The result of decryption. Also sometimes used to describe data that has not been encrypted, as in “The mistake was sending that message as plaintext.” Compare with ciphertext. [Ch. 11]
点对点描述两个站之间的通信链路,与广播或多点链路形成对比。[第 7 章]
Point-to-point Describes a communication link between two stations, as contrasted with a broadcast or multipoint link. [Ch. 7]
轮询线程之间或处理器与设备之间的一种交互方式,其中一个线程会定期检查另一个线程是否需要注意。[第 5 章]
Polling A style of interaction between threads or between a processor and a device in which one periodically checks whether the other needs attention. [Ch. 5]
端口在端到端传输协议中,多路复用标识符用于指示多个端到端应用程序或应用程序实例中的哪一个应接收有效负载。[第 7 章]
Port In an end-to-end transport protocol, the multiplexing identifier that tells which of several end-to-end applications or application instances should receive the payload. [Ch. 7]
抢占式调度一种调度策略,其中线程管理器可以随时中断并重新调度正在运行的线程。[第 5 章]
Preemptive scheduling A scheduling policy in which a thread manager can interrupt and reschedule a running thread at any time. [Ch. 5]
预分页多级内存管理器的一种优化,其中管理器预测可能需要哪些页面,并在应用程序需要之前将它们放入主内存中。与需求算法进行比较。[第 6 章]
Prepaging An optimization for a multilevel memory manager in which the manager predicts which pages might be needed and brings them into the primary memory before the application demands them. Compare with demand algorithm. [Ch. 6]
准备就绪在分层或多站点全有或全无操作中,组件操作的状态已宣布它可以根据命令提交或中止。达到此状态后,它会等待更高层的操作协调器做出决定。[第 9 章]
Prepared In a layered or multiple-site all-or-nothing action, a state of a component action that has announced that it can, on command, either commit or abort. Having reached this state, it awaits a decision from the higher-layer coordinator of the action. [Ch. 9]
表示协议一种将网络语义和数据转换为本地编程环境的协议。[第 7 章]
Presentation protocol A protocol that translates semantics and data of the network to match those of the local programming environment. [Ch. 7]
给出的负载参见提供的负载。
Presented load See offered load.
预防性维护主动干预,旨在增加模块或系统的平均故障时间,从而提高其可靠性和可用性。[第 8 章]
Preventive maintenance Active intervention intended to increase the mean time to failure of a module or system and thus improve its reliability and availability. [Ch. 8]
主副本 一组未同步写入或更新的副本中,被视为权威副本,通常最先写入或更新。与镜像和备份副本比较。[第 10 章]
Primary copy Of a set of replicas that are not written or updated synchronously, the one that is considered authoritative and, usually, written or updated first. Compare with mirror and backup copy. [Ch. 10]
主设备在多级存储系统中,主设备是速度更快、通常更昂贵且因此体积更小的存储设备。与辅助设备比较。[第 6 章]
Primary device In a multilevel memory system, the memory device that is faster and usually more expensive and thus smaller. Compare with secondary device. [Ch. 6]
主体计算机系统内部向安全系统发出请求的代理(人、计算机、线程)的表示。主体是计算机系统中被授予授权的实体;因此,它是计算机系统中的问责和责任单位。[第 11 章]
Principal The representation inside a computer system of an agent (a person, a computer, a thread) that makes requests to the security system. A principal is the entity in a computer system to which authorizations are granted; thus, it is the unit of accountability and responsibility in a computer system. [Ch. 11]
优先级调度策略某些作业优先于其他作业的调度策略。[Ch. 6]
Priority scheduling policy A scheduling policy in which some jobs have priority over other jobs. [Ch. 6]
隐私个人(或组织)决定是否、何时、向谁发布个人(或组织)信息,以及对发布信息的使用应施加哪些限制的一项社会定义的能力。[第 11 章]
Privacy A socially defined ability of an individual (or organization) to determine if, when, and to whom personal (or organizational) information is to be released and also what limitations should apply to use of released information. [Ch. 11]
私钥:公钥密码术中必须保密的加密密钥。与公钥相对。 [第 11 章]
Private key In public-key cryptography, the cryptographic key that must be kept secret. Compare with public key. [Ch. 11]
处理延迟在通信网络中,由各个协议层中进行的计算所导致的总体延迟部分。[第 7 章]
Processing delay In a communication network, that component of the overall delay contributed by computation that takes place in various protocol layers. [Ch. 7]
程序计数器一种处理器寄存器,用于保存对处理器要执行的当前指令或下一条指令的引用。[Ch. 2]
Program counter A processor register that holds the reference to the current or next instruction that the processor is to execute. [Ch. 2]
进度原子性保证机制提供的一种理想保证:尽管可能受到并发干扰,但仍会完成一些有用的工作。这种保证的一个例子是,原子性保证机制不会中止并发操作集合中的至少一个成员。实际上,有时可以使用指数随机退避来修复进度保证的缺失。在系统的形式化分析中,进度是“活跃性”属性的一个组成部分。进度是系统将朝着某个特定目标前进的保证,而活跃性是系统最终将达到该目标的保证。[第 9 章]
Progress A desirable guarantee provided by an atomicity-assuring mechanism: that, despite potential interference from concurrency, some useful work will be done. An example of such a guarantee is that the atomicity-assuring mechanism will not abort at least one member of the set of concurrent actions. In practice, lack of a progress guarantee can sometimes be repaired by using exponential random backoff. In formal analysis of systems, progress is one component of a property known as “liveness”. Progress is an assurance that the system will move toward some specified goal, whereas liveness is an assurance that the system will eventually reach that goal. [Ch. 9]
传播延迟在通信网络中,由用于通信的物理介质的传播速度所贡献的总延迟分量。[第 7 章]
Propagation delay In a communication network, the component of overall delay contributed by the velocity of propagation of the physical medium used for communication. [Ch. 7]
效应传播大多数系统都有这样的特性:系统某一部分的变化会对远离变化部分的区域产生影响。良好的系统设计往往会将效应传播降到最低。[第 1 章]
Propagation of effects A property of most systems: a change in one part of the system causes effects in areas of the system that are far removed from the changed part. A good system design tends to minimize propagation of effects. [Ch. 1]
保护1.安全的同义词。2. 有时以较狭义使用,表示控制执行程序对信息的访问的机制和技术。[第 11 章]
Protection 1. Synonym for security. 2. Sometimes used in a narrower sense to denote mechanisms and techniques that control the access of executing programs to information. [Ch. 11]
保护组由多个用户共享的主体。[第 11 章]
Protection group A principal that is shared by more than one user. [Ch. 11]
协议两个通信方之间的协议,例如,关于他们打算交换的消息和数据格式的协议。[第 7 章]
Protocol An agreement between two communicating parties, for example, on the messages and the format of data that they intend to exchange. [Ch. 7]
公钥在公钥密码术中,可以发布的密钥(即无需保密的密钥)。与私钥相对应。[第 11 章]
Public key In public-key cryptography, the key that can be published (i.e., the one that doesn’t have to be kept secret). Compare with private key. [Ch. 11]
公钥加密一种基于密钥的加密转换,可以提供消息的机密性和真实性,而无需在发送者和接收者之间共享秘密。公钥系统使用两个加密密钥,其中一个必须保密,但不需要共享。[第 11 章]
Public-key cryptography A key-based cryptographic transformation that can provide both confidentiality and authenticity of messages without the need to share a secret between sender and recipient. Public-key systems use two cryptographic keys, one of which must be kept secret, but does not need to be shared. [Ch. 11]
发布/订阅一种使用受信任中介的通信方式。客户端向中介推送消息或从中介拉取消息。中介决定谁实际接收消息以及是否应将消息发送给多个收件人。[第 4 章]
Publish/subscribe A communication style using a trusted intermediary. Clients push or pull messages to or from an intermediary. The intermediary determines who actually receives a message and if a message should be fanned out to multiple recipients. [Ch. 4]
纯名称未以任何方式重载的名称。适用于纯名称的唯一操作是COMPARE、RESOLVE、BIND和UNBIND。与重载名称相对。[第 3 章]
Pure name A name that is not overloaded in any way. The only operations that apply to a pure name are COMPARE, RESOLVE, BIND, and UNBIND. Contrast with overloaded name. [Ch. 3]
清除某些 N 模冗余设计中使用的一种技术,其中投票者忽略过去某个时间与其他几个副本不一致的任何副本的输出。[第 8 章]
Purging A technique used in some N-modular redundancy designs, in which the voter ignores the output of any replica that, at some time in the past, disagreed with several others. [Ch. 8]
限定名称包含明确上下文引用的名称。[Ch. 2]
Qualified name A name that includes an explicit context reference. [Ch. 2]
抑制(n.) 数据包转发器向另一个转发器或端到端层发送器发送的管理消息,要求转发器或发送器停止发送数据或降低其发送数据的速率。[第 7 章]
Quench (n.) An administrative message sent by a packet forwarder to another forwarder or to an end-to-end-layer sender asking that the forwarder or sender stop sending data or reduce its rate of sending data. [Ch. 7]
排队延迟在通信网络中,由于等待资源(例如链路)可用而导致的总延迟分量。[第 7 章]
Queuing delay In a communication network, the component of overall delay that is caused by waiting for a resource such as a link to become available. [Ch. 7]
仲裁旨在提高可用性的部分副本集。定义相交的读取仲裁和写入仲裁,目的是确保正确性,从读取仲裁读取并写入写入仲裁即可。[第 10 章]
Quorum A partial set of replicas intended to improve availability. One defines a read quorum and a write quorum that intersect, with the goal that for correctness it is sufficient to read from a read quorum and write to a write quorum. [Ch. 10]
竞争条件线程协调中与时间相关的错误,可能导致线程计算出不正确的结果(例如,多个线程同时尝试更新共享变量,而它们本应一次更新一个)。[第 5 章]
Race condition A timing-dependent error in thread coordination that may result in threads computing incorrect results (for example, multiple threads simultaneously try to update a shared variable that they should have updated one at a time). [Ch. 5]
RAID独立(或廉价)磁盘冗余阵列的缩写,是一组使用控制器和多个磁盘驱动器配置的技术,旨在提高存储性能或耐用性。RAID 系统通常具有与单个磁盘在电气和编程上完全相同的接口,因此可以透明地替换单个磁盘。[第 2 章]
RAID An acronym for Redundant Array of Independent (or Inexpensive) Disks, a set of techniques that use a controller and multiple disk drives configured to improve some combination of storage performance or durability. A RAID system usually has an interface that is electrically and programmatically identical to a single disk, thus allowing it to transparently replace a single disk. [Ch. 2]
随机存取存储器一种存储器设备,其随机选择的存储器单元的延迟与选择最适合该存储器设备的模式中的单元所获得的延迟大致相同。[Ch. 2]
Random access memory A memory device for which the latency for memory cells chosen at random is approximately the same as the latency obtained by choosing cells in the pattern best suited for that memory device. [Ch. 2]
随机丢弃管理过载资源的策略:系统拒绝为随机选择的队列成员提供服务。[第 7 章]
Random drop A strategy for managing an overloaded resource: the system refuses service to a queue member chosen at random. [Ch. 7]
随机早期检测 (RED)随机丢弃和早期丢弃的组合。[第 7 章]
Random early detection (RED) A combination of random drop and early drop. [Ch. 7]
速率单调调度策略一种为实时系统调度周期性作业的策略。每个作业都会提前获得与该作业发生频率成比例的优先级。调度程序始终运行优先级最高的作业,必要时会抢占正在运行的作业。[第 6 章]
Rate monotonic scheduling policy A policy that schedules periodic jobs for a real-time system. Each job receives in advance a priority that is proportional to the frequency of the occurrence of that job. The scheduler always runs the highest priority job, preempting a running job, if necessary. [Ch. 6]
读取和设置内存 ( RSM )主要用于实现锁的硬件或软件功能。RSM将一个值从内存位置加载到寄存器中,并将另一个值存储到同一内存位置。RSM 的重要属性是,并发线程的任何其他加载和存储都不能介于RSM的加载和存储之间。RSM几乎总是作为硬件指令实现的。[第 5 章]
Read and set memory (RSM) A hardware or software function used primarily for implementing locks. RSM loads a value from a memory location into a register and stores another value in the same memory location. The important property of RSM is that no other loads and stores by concurrent threads can come between the load and the store of an RSM. RSM is nearly always implemented as a hardware instruction. [Ch. 5]
读/写一致性内存的一个属性,即READ总是返回最近WRITE的结果。[Ch. 2]
Read/write coherence A property of a memory, that a READ always returns the result of the most recent WRITE. [Ch. 2]
就绪/确认协议一种数据传输协议,其中每次传输都由来自发送方的就绪信号和来自接收方的确认信号构成。[第 7 章]
Ready/acknowledge protocol A data transmission protocol in which each transmission is framed by a ready signal from the sender and an acknowledge signal from the receiver. [Ch. 7]
实时1.(adj.)描述需要在某个截止日期之前交付结果的系统。 2.(n.)全能观察者会将挂钟序列与一系列动作联系起来。[Ch. 6]
Real time 1. (adj.) Describes a system that requires delivery of results before some deadline. 2. (n.) The wall-clock sequence that an all-seeing observer would associate with a series of actions. [Ch. 6]
实时调度策略一种调度程序,尝试以某种方式调度作业,使得所有作业都在截止期限之前完成。[Ch. 6]
Real-time scheduling policy A scheduler that attempts to schedule jobs in such a way that all jobs complete before their deadlines. [Ch. 6]
重组通过按正确顺序排列为传输而划分的各段来重建消息。[第 7 章]
Reassembly Reconstructing a message by arranging, in correct order, the segments it was divided into for transmission. [Ch. 7]
协调对旨在相同的副本进行比较并修复任何差异的过程。[第 10 章]
Reconciliation A procedure that compares replicas that are intended to be identical and repairs any differences. [Ch. 10]
递归名称解析解析路径名的一种方法。在路径名其余部分命名的上下文中查找路径名的最低有效部分,因此必须首先解析该部分。[第 2 章]
Recursive name resolution A method of resolving path names. The least significant component of the path name is looked up in the context named by the remainder of the path name, which must thus be resolved first. [Ch. 2]
重做操作一种应用程序指定的操作,在故障恢复期间执行时,将产生某些已提交组件操作的效果,而这些操作的效果可能已在故障中丢失。(有些系统称之为“执行操作”。与撤消操作比较。)[第 9 章]
Redo action An application-specified action that, when executed during failure recovery, produces the effect of some committed component action whose effect may have been lost in the failure. (Some systems call this a “do action”. Compare with undo action.) [Ch. 9]
冗余添加的额外信息,用于检测或纠正数据或控制信号中的错误。[第 8 章]
Redundancy Extra information added to detect or correct errors in data or control signals. [Ch. 8]
引用(n.) 一个对象使用名称来引用另一个对象。在英语语法中,对应的动词是“to refer to”。在计算机术语中,非标准动词“to refer”经常出现,而新创动词“dereference”是resolve的同义词。[第 2 章]
Reference (n.) Use of a name by an object to refer to another object. In grammatical English, the corresponding verb is “to refer to”. In computer jargon, the non-standard verb “to reference” appears frequently, and the coined verb “dereference” is a synonym for resolve. [Ch. 2]
引用字符串线程在执行期间发出的地址字符串(通常是线程执行LOAD和STORE指令时发出的虚拟地址字符串;也可能包括指令本身的地址)。[Ch. 6]
Reference string The string of addresses issued by a thread during its execution (typically, the string of the virtual addresses issued by a thread’s execution of LOAD and STORE instructions; it may also include the addresses of the instructions themselves). [Ch. 6]
相对路径名名称解析器在环境提供的默认上下文中解析的路径名。[Ch. 2]
Relative path name A path name that the name resolver resolves in a default context provided by the environment. [Ch. 2]
可靠性一种统计量度,指系统在t时刻仍在运行(假设它在 t 时刻运行的时间是较早的某个时间t 0 )的概率。[第 8 章]
Reliability A statistical measure, the probability that a system is still operating at time t, given that it was operating at some earlier time t 0. [Ch. 8]
可靠交付一种传输协议保证:它既提供至少一次交付,又提供数据完整性。[第 7 章]
Reliable delivery A transport protocol assurance: it provides both at-least-once delivery and data integrity. [Ch. 7]
远程过程调用 (RPC)一种客户端/服务交互的程式化形式,其中每个请求后都会有一个响应。通常,远程过程调用系统还提供请求和响应数据的编组和解组。“远程过程调用”中的“过程”一词具有误导性,因为 RPC 语义与普通过程调用的语义不同:例如,RPC 专门允许客户端和服务独立失败。[第 4 章]
Remote procedure call (RPC) A stylized form of client/service interaction in which each request is followed by a response. Usually, remote procedure call systems also provide marshaling and unmarshaling of the request and the response data. The word “procedure” in “remote procedure call” is misleading, since RPC semantics are different from those of an ordinary procedure call: for example, RPC specifically allows for clients and the service to fail independently. [Ch. 4]
修复一种主动干预措施,修复或更换已确定为故障的模块,最好是在模块所属的系统发生故障之前。[第 8 章]
Repair An active intervention to fix or replace a module that has been identified as failing, preferably before the system of which it is a part fails. [Ch. 8]
指令库 解释器准备执行的一组操作或动作。通用处理器的指令库是其指令集。[第 2 章]
Repertoire The set of operations or actions an interpreter is prepared to perform. The repertoire of a general-purpose processor is its instruction set. [Ch. 2]
副本1. 多个相同模块中的一个,当输入相同时,预计会产生相同的输出。 2. 一组数据的多个相同副本中的一个。[第 8 章]
Replica 1. One of several identical modules that, when presented with the same inputs, is expected to produce the same output. 2. One of several identical copies of a set of data. [Ch. 8]
复制状态机对一组副本执行更新的方法,涉及将更新请求发送到每个副本并在每个副本上独立执行。[Ch. 10]
Replicated state machine A method of performing an update to a set of replicas that involves sending the update request to each replica and performing it independently at each replica. [Ch. 10]
复制使用多个副本来实现容错的技术。[第 8 章]
Replication The technique of using multiple replicas to achieve fault tolerance. [Ch. 8]
否认否认一条显然已经证实的消息。[第 11 章]
Repudiate To disown an apparently authenticated message. [Ch. 11]
请求客户端向服务发送的消息。[第 4 章]
Request The message sent from a client to a service. [Ch. 4]
解析执行从名称到相应值的名称映射算法。[Ch. 2]
Resolve To perform a name-mapping algorithm from a name to the corresponding value. [Ch. 2]
响应服务为响应先前的请求而向客户端发送的消息。[第 4 章]
Response The message sent from a service to a client in response to a previous request. [Ch. 4]
前滚恢复一种预写日志协议,附加要求应用程序在执行任何安装操作之前记录其结果记录。如果在全有或全无操作通过其提交点之前发生故障,则恢复过程无需撤消任何操作;如果在提交后发生故障,则恢复过程可以使用日志记录来确保单元存储安装不会丢失。也称为重做日志记录。与回滚恢复进行比较。[第 9 章]
Roll-forward recovery A write-ahead log protocol with the additional requirement that the application log its outcome record before it performs any install actions. If there is a failure before the all-or-nothing action passes its commit point, the recovery procedure does not need to undo anything; if there is a failure after commit, the recovery procedure can use the log record to ensure that cell storage installs are not lost. Also known as redo logging. Compare with rollback recovery. [Ch. 9]
回滚恢复也称为撤消日志记录。一种预写日志协议,附加要求应用程序在记录结果记录之前执行所有安装操作。如果在全有或全无操作提交之前发生故障,恢复过程可以使用日志记录撤消部分完成的全有或全无操作。与前滚恢复进行比较。[第 9 章]
Rollback recovery Also known as undo logging. A write-ahead log protocol with the additional requirement that the application perform all install actions before logging an outcome record. If there is a failure before the all-or-nothing action commits, a recovery procedure can use the log record to undo the partially completed all-or-nothing action. Compare with roll-forward recovery. [Ch. 9]
根用于解释绝对路径名的上下文。根的名称通常绑定到一个常量值(通常是下层的知名名称),并且该绑定通常在设计时内置到名称解析器中。[第 2 章]
Root The context used for the interpretation of absolute path names. The name for the root is usually bound to a constant value (typically, a well-known name of a lower layer), and that binding is normally built in to the name resolver at design time. [Ch. 2]
循环调度一种抢占式调度策略,其中线程在调度下一个线程之前运行一段最长的时间。当所有线程都运行完毕后,调度程序将从第一个线程再次启动。[第 6 章]
Round-robin scheduling A preemptive scheduling policy in which a thread runs for some maximum time before the next one is scheduled. When all threads have run, the scheduler starts again with the first thread. [Ch. 6]
往返时间在网络中,发送数据包和接收相应响应或确认之间的时间。往返时间包括两个(可能不同的)网络传输时间以及通信方处理数据包和准备响应所需的时间。[第 7 章]
Round-trip time In a network, the time between sending a packet and receiving the corresponding response or acknowledgment. Round-trip time comprises two (possibly different) network transit times and the time required for the correspondent to process the packet and prepare a response. [Ch. 7]
路由器参与路由算法的数据包转发器。[第 7 章]
Router A packet forwarder that also participates in a routing algorithm. [Ch. 7]
路由算法旨在构建一致、高效的转发表的算法。路由算法可以是集中式的,这意味着一个节点计算整个网络的转发表,也可以是分散式的,这意味着许多参与者同时执行该算法。[第 7 章]
Routing algorithm An algorithm intended to construct consistent, efficient forwarding tables. A routing algorithm can be either centralized, which means that one node calculates the forwarding tables for the entire network, or decentralized, which means that many participants perform the algorithm concurrently. [Ch. 7]
调度程序线程管理器的一部分,用于实施决定运行哪个线程的策略。策略可以是抢占式的,也可以是非抢占式的。[第 5 章]
Scheduler The part of the thread manager that implements the policy for deciding which thread to run. Policies can be preemptive or non-preemptive. [Ch. 5]
范围在分层命名方案中,特定名称绑定到相同值的上下文集。[Ch. 2]
Scope In a layered naming scheme, the set of contexts in which a particular name is bound to the same value. [Ch. 2]
搜索在命名中,它是多重查找的同义词。该术语的用法是信息检索和全文搜索系统中更通用的搜索定义的高度受限形式:查找与给定查询匹配的所有记录实例。[第 2 章]
Search As used in naming, a synonym for multiple lookup. This usage of the term is a highly constrained form of the more general definition of search as used in information retrieval and full-text search systems: to locate all instances of records that match a given query. [Ch. 2]
搜索路径由要在多重查找名称解析中使用的上下文标识符组成的默认上下文引用。此处使用的“路径”一词与其在路径名称中的用法无关,而“搜索”一词与关键字搜索的概念只有很远的联系。[第 2 章]
Search path A default context reference that consists of the identifiers of the contexts to be used in a multiple lookup name resolution. The word “path” as used here has no connection with its use in path name, and the word “search” has only a distant connection with the concept of key word search. [Ch. 2]
次要设备在多级存储系统中,次要设备是指较大但通常速度较慢的存储设备。与主设备比较。[第 6 章]
Secondary device In a multilevel memory system, the memory device that is larger but also usually slower. Compare with primary device. [Ch. 6]
保密机密性的同义词。[第 11 章]
Secrecy Synonym for confidentiality. [Ch. 11]
安全区域可以安全地保护机密信息的物理空间或虚拟地址空间。[第 11 章]
Secure area A physical space or a virtual address space in which confidential information can be safely confined. [Ch. 11]
安全通道可以安全地将信息从一个安全区域发送到另一个安全区域的通信通道。该通道可以提供机密性或真实性,或者更常见的是两者兼具。[第 11 章]
Secure channel A communication channel that can safely send information from one secure area to another. The channel may provide confidentiality or authenticity or, more commonly, both. [Ch. 11]
安全保护信息和信息系统,防止未经授权的访问或修改信息(无论是在存储、处理还是传输中),以及防止拒绝向授权用户提供服务。[第 11 章]
Security The protection of information and information systems against unauthorized access or modification of information, whether in storage, processing, or transit, and against denial of service to authorized users. [Ch. 11]
安全协议为实现某些安全目标(例如,验证发送者)而设计的消息协议。安全协议的设计者必须假设某些通信方是对手。[第 11 章]
Security protocol A message protocol designed to achieve some security objective (e.g., authenticating a sender). Designers of security protocols must assume that some of the communicating parties are adversaries. [Ch. 11]
段1. 一个编号的连续寻址虚拟内存块,该块具有从地址零开始到某个指定大小的一系列内存地址。为基于段的虚拟内存编写的程序发出的地址实际上是两个数字:第一个数字标识段号,第二个数字标识该段内的地址。内存管理器必须转换段号以确定段在实际内存中的位置。第二个地址可能也需要使用页面映射进行转换。[Ch. 5] 2. 在通信网络中,端到端层提供给网络层以便在网络上转发的数据。段是数据包的有效载荷。与帧(n.)、消息和数据包进行比较。[Ch. 7]
Segment 1. A numbered block of contiguously addressed virtual memory, the block having a range of memory addresses starting with address zero and ending at some specified size. Programs written for a segment-based virtual memory issue addresses that are really two numbers: the first identifies the segment number, and the second identifies the address within that segment. The memory manager must translate the segment number to determine where in real memory the segment is located. The second address may also require translation using a page map. [Ch. 5] 2. In a communication network, the data that the end-to-end layer gives to the network layer for forwarding across the network. A segment is the payload of a packet. Compare with frame (n.), message, and packet. [Ch. 7]
自定步调某些传输协议的属性。自定步调协议会自动调整其传输速率,以匹配其所运行网络的瓶颈数据速率。[第 7 章]
Self-pacing A property of some transmission protocols. A self-pacing protocol automatically adjusts its transmission rate to match the bottleneck data rate of the network over which it is operating. [Ch. 7]
信号量一种特殊类型的共享变量,用于多个并发线程之间的序列协调。信号量支持两个原子操作:DOWN和UP。如果信号量的值大于零,则DOWN会减少信号量并返回给其调用者;否则,DOWN会释放其处理器,直到另一个线程使用UP增加信号量。当控制权返回到最初发出DOWN操作的线程时,该线程将重试DOWN操作。[第 5 章]
Semaphore A special type of shared variable for sequence coordination among several concurrent threads. A semaphore supports two atomic operations: DOWN and UP. If the semaphore’s value is larger than zero, DOWN decrements the semaphore and returns to its caller; otherwise, DOWN releases its processor until another thread increases the semaphore using UP. When control returns to the thread that originally issued the DOWN operation, that thread retries the DOWN operation. [Ch. 5]
序列协调线程之间的协调约束:为了保证正确性,一个线程中的某个事件必须先于另一个线程中的某个事件。[第 5 章]
Sequence coordination A coordination constraint among threads: for correctness, a certain event in one thread must precede some other certain event in another thread. [Ch. 5]
序列器一种用于序列协调的特殊类型的共享变量。序列器上的主要操作是TICKET,其操作方式类似于面包店或邮局中的“取号”机:两个线程同时在同一序列器上调用TICKET ,会收到不同的值,返回值的顺序与TICKET的执行时间顺序相对应。[第 5 章]
Sequencer A special type of shared variable used for sequence coordination. The primary operation on a sequencer is TICKET, which operates likes the “take a number” machine in a bakery or post office: two threads concurrently calling TICKET on the same sequencer receive different values, and the ordering of the values returned corresponds to the time ordering of the execution of TICKET. [Ch. 5]
串行传输一种通过在一条传输线上发送一系列自时钟位(不频繁确认或不确认)来提高两个模块之间的数据速率的方案。[第 7 章]
Serial transmission A scheme for increasing the data rate between two modules by sending a series of self-clocking bits over a single transmission line with infrequent or no acknowledgments. [Ch. 7]
可序列化前后操作的一种属性,即使多个操作同时进行,其结果也和按顺序(换句话说,串行)依次执行一样。[第 9 章]
Serializable A property of before-or-after actions, that even if several operate concurrently, the result is the same as if they had acted one at a time, in some sequential (in other words, serial) order. [Ch. 9]
服务器实现服务的模块。多个服务器可能实现相同的服务,或协作实现服务的容错版本,这样即使服务器发生故障,服务仍然可用。[第 4 章]
Server A module that implements a service. More than one server might implement the same service, or collaborate to implement a fault tolerant version of the service such that even if a server fails, the service is still available. [Ch. 4]
服务响应客户端发起的操作的模块。[第 4 章]在网络的端到端层,响应另一端发起的操作的一端。与客户端比较。[第 7 章]
Service A module that responds to actions initiated by clients. [Ch. 4] At the end-to-end layer of a network, the end that responds to actions initiated by the other end. Compare with client. [Ch. 7]
设置分配存储空间和初始化连接状态所需的步骤。[第 7 章]
Set up The steps required to allocate storage space for and initialize the state of a connection. [Ch. 7]
影子副本一个对象的工作副本,由全有或全无操作创建,以便可以对对象进行多项更改,而原始对象保持不变。当全有或全无操作完成所有更改后,它会小心地将工作副本与原始副本进行交换,从而保留所有更改都是原子发生的表象。根据实现方式,原始副本或工作副本可能被标识为“影子”副本,但无论哪种情况,技术都是相同的。[第 9 章]
Shadow copy A working copy of an object that an all-or-nothing action creates so that it can make several changes to the object while the original remains unmodified. When the all-or-nothing action has made all of the changes, it then carefully exchanges the working copy with the original, thus preserving the appearance that all of the changes occurred atomically. Depending on the implementation, either the original or the working copy may be identified as the “shadow” copy, but the technique is the same in either case. [Ch. 9]
共享秘密密码术 一种基于密钥的密码转换,其中转换的密码密钥可以很容易地从逆转换的密钥中确定,反之亦然。在大多数共享秘密系统中,转换和逆转换的密钥是相同的。[第 11 章]
Shared-secret cryptography A key-based cryptographic transformation in which the cryptographic key for transforming can be easily determined from the key for the reverse transformation, and vice versa. In most shared-secret systems, the keys for a transformation and its reverse transformation are identical. [Ch. 11]
共享密钥 共享密钥加密系统使用的密钥。[第 11 章]
Shared-secret key The key used by a shared-secret cryptography system. [Ch. 11]
共享允许一个对象被多个其他对象使用,而无需第一个对象的多个副本。[Ch. 2]
Sharing Allowing an object to be used by more than one other object without requiring multiple copies of the first object. [Ch. 2]
签名通过转换消息来生成身份验证标签,以便接收方可以使用该标签来验证消息是否真实。“签名”一词通常仅限于公钥认证系统。共享密钥认证系统的相应描述是“生成 MAC”。[第 11 章]
Sign To generate an authentication tag by transforming a message so that a receiver can use the tag to verify that the message is authentic. The word “sign” is usually restricted to public-key authentication systems. The corresponding description for shared-secret authentication systems is “generate a MAC”. [Ch. 11]
简单锁定一种锁定协议,用于创建前后操作,要求在到达锁定点之前不读取或写入任何数据。为了使原子操作也成为全有或全无,进一步的要求是在提交(或中止)之前不释放任何锁。与两阶段锁定进行比较。[第 9 章]
Simple locking A locking protocol for creating before-or-after actions requiring that no data be read or written before reaching the lock point. For the atomic action to also be all-or-nothing, a further requirement is that no locks be released before commit (or abort). Compare with two-phase locking. [Ch. 9]
简单序列化一种原子性协议,要求每个新创建的原子操作必须等待,直到所有先前启动的原子操作不再处于待处理状态才能开始执行。[第 9 章]
Simple serialization An atomicity protocol requiring that each newly created atomic action must wait to begin execution until all previously started atomic actions are no longer pending. [Ch. 9]
单工描述两个站之间只能单向使用的链路。与双工、半双工和全双工进行比较。[第 7 章]
Simplex Describes a link between two stations that can be used in only one direction. Compare with duplex, half-duplex, and full-duplex. [Ch. 7]
单次获取协议一种简单的锁定协议:仅当其他线程尚未获取锁时,线程才可以获取锁。[第 5 章]
Single-acquire protocol A simple protocol for locking: a thread can acquire a lock only if some other thread has not already acquired it. [Ch. 5]
单粒子翻转瞬态故障的同义词。[第 8 章]
Single-event upset A synonym for transient fault. [Ch. 8]
从属在多站点复制方案中,从属站点仅接受来自主站点的更新请求。与主站点比较。 [第 10 章]
Slave In a multiple-site replication scheme, a site that takes update requests from only the master site. Compare with master. [Ch. 10]
滑动窗口流量控制中的一项技术,接收方在完全使用前一个分配的数据之前发送一个额外的窗口分配,目的是让新的分配及时到达发送方,以保持数据顺畅流动,同时考虑到网络的传输时间。[第 7 章]
Sliding window In flow control, a technique in which the receiver sends an additional window allocation before it has fully consumed the data from the previous allocation, intending that the new allocation arrive at the sender in time to keep data flowing smoothly, taking into account the transit time of the network. [Ch. 7]
史努比缓存在每个处理器中都有总线和缓存的多处理器系统中,缓存设计为主动监视总线上的流量,以监视使缓存条目无效的事件。[Ch. 10]
Snoopy cache In a multiprocessor system with a bus and a cache in each processor, a cache design in which the cache actively monitors traffic on the bus to watch for events that invalidate cache entries. [Ch. 10]
软模块化由惯例定义但不受物理约束强制执行的模块化。与强制模块化相比较。[第 4 章]
Soft modularity Modularity defined by convention but not enforced by physical constraints. Compare with enforced modularity. [Ch. 4]
软实时调度程序可以接受偶尔错过截止期限的实时调度程序。[第 6 章]
Soft real-time scheduler A real-time scheduler in which missing a deadline occasionally is acceptable. [Ch. 6]
软状态正在运行的程序的状态,如果需要突然终止并重新启动程序,程序可以轻松重建该状态。[第 8 章]
Soft state State of a running program that the program can easily reconstruct if it becomes necessary to abruptly terminate and restart the program. [Ch. 8]
源发起数据包有效负载的网络附加点。有时用作源地址的简写。[第 7 章]
Source The network attachment point that originated the payload of a packet. Sometimes used as shorthand for source address. [Ch. 7]
源地址数据包源的标识符,通常作为数据包头中的字段承载。[第 7 章]
Source address An identifier of the source of a packet, usually carried as a field in the header of the packet. [Ch. 7]
空间局部性一种引用局部性,其中引用字符串包含对相邻或附近地址的引用簇。[Ch. 6]
Spatial locality A kind of locality of reference in which the reference string contains clusters of references to adjacent or nearby addresses. [Ch. 6]
代表用来表达委托人之间委托关系的短语。“A 代表 B”表示 B 已将部分权力委托给 A。[第 11 章]
Speaks for A phrase used to express delegation relationships between principals. “A speaks for B” means that B has delegated some authority to A. [Ch. 11]
推测一种通过在收到请求之前执行操作来提高性能的技术,以防万一。希望能够以更少的延迟和更少的设置开销提供结果。示例包括使用比严格需要的更大的页面进行请求分页、预分页、预取和在需要主设备空间之前写入脏页。[第 6 章]
Speculation A technique to improve performance by performing an operation in advance of receiving a request on the chance that it will be requested. The hope is that the result can be delivered with less latency and with less setup overhead. Examples include demand paging with larger pages than strictly necessary, prepaging, prefetching, and writing dirty pages before the primary device space is needed. [Ch. 6]
自旋循环线程等待事件发生而不释放处理器的情况。[第 5 章]
Spin loop A situation in which a thread waits for an event to happen without releasing the processor. [Ch. 5]
稳定性物体的一种属性,一旦它具有价值,它就会无限期地维持该价值。与耐久性和持久性相比,这两个术语有不同的技术定义,如边栏 2.1中所述。[第 2 章]
Stability A property of an object that, once it has a value, it maintains that value indefinitely. Compare with durability and persistence, terms that have different technical definitions, as explained in Sidebar 2.1. [Ch. 2]
稳定绑定保证在名称空间的整个生命周期内将名称映射到相同值的绑定。唯一标识符名称空间的功能之一。[第 2 章]
Stable binding A binding that is guaranteed to map a name to the same value for the lifetime of the name space. One of the features of a unique identifier name space. [Ch. 2]
堆栈算法一种页面移除算法,其中,如果m小于n ,则大小为m的主设备中的页面集始终是大小为n的主设备中的页面集的子集。堆栈算法具有以下特性:增加内存大小可保证不会导致缺页异常数量增加。[第 6 章]
Stack algorithm A class of page-removal algorithms in which the set of pages in a primary device of size m is always a subset of the set of pages in a primary device of size n, if m is smaller than n. Stack algorithms have the property that increasing the size of the memory is guaranteed not to result in increased numbers of missing-page exceptions. [Ch. 6]
饥饿一种不良情况,其中多个线程争夺共享资源,并且由于调度不利,一个或多个线程永远无法获得资源份额。[第 6 章]
Starvation An undesirable situation in which several threads are competing for a shared resource and because of adverse scheduling one or more of the threads never receives a share of the resource. [Ch. 6]
静态路由一种设置转发表的方法,一旦计算出来,转发表就不会根据网络拓扑和负载的变化而自动更改。与自适应路由相比较。[第 7 章]
Static routing A method for setting up forwarding tables in which, once calculated, they do not automatically change in response to changes in network topology and load. Compare with adaptive routing. [Ch. 7]
静态作用域显式上下文的一个示例,用于解析某些编程语言中的程序变量名称。名称解析器从使用该名称的过程开始搜索绑定,然后在定义第一个过程的过程中搜索,依此类推。有时称为词法作用域。与动态作用域进行比较。[第 2 章]
Static scope An example of an explicit context, used to resolve names of program variables in some programming languages. The name resolver searches for a binding starting with the procedure that used the name, then in the procedure in which the first procedure was defined, and so on. Sometimes called lexical scope. Compare with dynamic scope. [Ch. 2]
站可以通过通信链路发送或接收数据的设备。[第 7 章]
Station A device that can send or receive data over a communication link. [Ch. 7]
停下来等待步调一致的同义词。 [第 7 章]
Stop and wait A synonym for lock step. [Ch. 7]
存储内存的另一个术语。非易失性且以大块形式读写的内存设备传统上被称为存储设备,但有足够多的例外情况,因此在实践中,“内存”和“存储”这两个词应该被视为同义词。[第 2 章]
Storage Another term for memory. Memory devices that are non-volatile and are read and written in large blocks are traditionally called storage devices, but there are enough exceptions that in practice the words “memory” and “storage” should be treated as synonyms. [Ch. 2]
存储转发一种转发网络组织,其中传输层消息被缓冲在非易失性存储器(例如磁盘)中,目的是确保它们永远不会丢失。许多作者将此术语用于任何转发网络。[第 7 章]
Store and forward A forwarding network organization in which transport-layer messages are buffered in a non-volatile memory such as magnetic disk, with the goal that they never be lost. Many authors use this term for any forwarding network. [Ch. 7]
流应用程序希望在网络的两个连接点之间流动的数据位或消息序列。它通常希望流中的数据按照发送的顺序进行传输,并且不会出现数据重复或遗漏。[第 7 章]
Stream A sequence of data bits or messages that an application intends to flow between two attachment points of a network. It also usually intends that the data of a stream be delivered in the order in which it was sent, and that there be no duplication or omission of data. [Ch. 7]
严格一致性一种接口要求,即在更新过程中临时违反数据不变量的行为在执行更新的操作之外永远不会可见。读/写一致性内存模型的一个特性是严格一致性。有时称为强一致性。[第 10 章]
Strict consistency An interface requirement that temporary violation of a data invariant during an update never be visible outside of the action doing the update. One feature of the read/write coherence memory model is strict consistency. Sometimes called strong consistency. [Ch. 10]
存根向调用者隐藏被调用者未按照普通过程调用约定调用的过程。存根可以将参数编组为消息并将该消息发送到服务,然后另一个存根解组该消息并调用被调用者。[第 4 章]
Stub A procedure that hides from the caller that the callee is not invoked with the ordinary procedure call conventions. The stub may marshal the arguments into a message and send the message to a service, where another stub unmarshals the message and invokes the callee. [Ch. 4]
超级模块一组复制的模块,以某种方式互连,使其像单个模块一样运行。[第 8 章]
Supermodule A set of replicated modules interconnected in such a way that it acts like a single module. [Ch. 8]
监控调用指令 (SVC)用户模块发出的处理器指令,用于将处理器的控制权移交给内核。[第 5 章]
Supervisor call instruction (SVC) A processor instruction issued by user modules to pass control of the processor to the kernel. [Ch. 5]
交换某些虚拟内存系统的一个特性,其中多级内存管理器从主设备中移除一个完整的地址空间并移入一个全新的地址空间。[第 6 章]
Swapping A feature of some virtual memory systems in which a multilevel memory manager removes a complete address space from a primary device and moves in a complete new one. [Ch. 6]
同义词映射到相同值的多个名称之一。与别名相比,别名通常(但并非总是)具有相同含义。[第 2 章]
Synonym One of multiple names that map to the same value. Compare with alias, a term that usually, but not always, has the same meaning. [Ch. 2]
系统一组相互连接的组件,在与环境的接口处具有可观察到的预期行为。与环境相对。[第 1 章]
System A set of interconnected components that has an expected behavior observed at the interface with its environment. Contrast with environment. [Ch. 1]
尾部丢弃管理过载资源的策略:系统拒绝为最近到达的队列条目提供服务。[第 7 章]
Tail drop A strategy for managing an overloaded resource: the system refuses service to the queue entry that arrived most recently. [Ch. 7]
拆除重置连接状态并释放用于存储该状态的空间所需的步骤。[第 7 章]
Tear down The steps required to reset the state of a connection and deallocate the space that was used for storage of that state. [Ch. 7]
时间局部性一种引用局部性,其中引用字符串包含对同一地址的紧密引用。[Ch. 6]
Temporal locality A kind of locality of reference in which the reference string contains closely spaced references to the same address. [Ch. 6]
系统抖动一种不良情况,主设备太小,无法运行一个线程或一组线程,从而导致频繁出现缺页异常。[第 6 章]
Thrashing An undesirable situation in which the primary device is too small to run a thread or a group of threads, leading to frequent missing-page exceptions. [Ch. 6]
线程一种抽象,封装了正在运行的模块的状态。此抽象封装了执行模块的解释器的足够多的状态,以便可以随时停止线程并在稍后恢复。停止线程并在稍后恢复线程的能力允许对解释器进行虚拟化。[第 5 章]
Thread An abstraction that encapsulates the state of a running module. This abstraction encapsulates enough of the state of the interpreter that executes the module so that one can stop a thread at any point in time and later resume it. The ability to stop a thread and resume it later allows virtualization of the interpreter. [Ch. 5]
线程管理器实现线程抽象的模块。它通常提供创建线程、销毁线程、允许线程放弃以及与其他线程协调的调用。[第 5 章]
Thread manager A module that implements the thread abstraction. It typically provides calls for creating a thread, destroying it, allowing the thread to yield, and coordinating with other threads. [Ch. 5]
威胁可能存在安全违规行为,由对手的蓄意攻击或合法用户的无意失误造成。[第 11 章]
Threat A potential security violation from either a planned attack by an adversary or an unintended mistake by a legitimate user. [Ch. 11]
吞吐量衡量服务在给定工作量下完成有用工作的速率。[第 6 章]
Throughput A measure of the rate of useful work done by a service for a given workload. [Ch. 6]
票证系统一种安全系统,其中每个主体都维护一个功能列表,每个功能对应主体有权访问的每个对象。[第 11 章]
Ticket system A security system in which each principal maintains a list of capabilities, one for each object to which the principal is authorized to have access. [Ch. 11]
可容忍错误可检测和可屏蔽的错误或错误类别,并且已实施系统性恢复程序。与可检测错误、可屏蔽错误和不可容忍错误进行比较。 [第 8 章]
Tolerated error An error or class of errors that is both detectable and maskable, and for which a systematic recovery procedure has been implemented. Compare with detectable error, maskable error, and untolerated error. [Ch. 8]
墓碑一段可能永远不会再使用的数据,但也不能丢弃,因为仍然有很小的机会需要它。[第 7 章]
Tombstone A piece of data that will probably never be used again but cannot be discarded because there is still a small chance that it will be needed. [Ch. 7]
尾部协议层添加到数据包末尾的信息。[第 7 章]
Trailer Information that a protocol layer adds to the end of a packet. [Ch. 7]
事务多步骤操作,在发生故障时和并发时都是原子的。也就是说,它既是全有或全无的,又是先发生或后发生的。[第 9 章]
Transaction A multistep action that is both atomic in the face of failure and atomic in the face of concurrency. That is, it is both all-or-nothing and before-or-after. [Ch. 9]
事务性内存一种内存模型,其中对主内存的多个引用都是全有或全无、前或后。[第 9 章]
Transactional memory A memory model in which multiple references to primary memory are both all-or-nothing and before-or-after. [Ch. 9]
瞬时故障一种暂时的故障,重试假定发生故障的组件很可能发现其正常。有时称为单事件故障。与持续性故障和间歇性故障进行比较。[第 8 章]
Transient fault A fault that is temporary and for which retry of the putatively failed component has a high probability of finding that it is okay. Sometimes called a single-event upset. Compare with persistent fault and intermittent fault. [Ch. 8]
传输时间在转发网络中,数据包从源到目的地所需的总延迟时间。在其他情况下,这种延迟有时称为等待时间。[第 7 章]
Transit time In a forwarding network, the total delay time required for a packet to go from its source to its destination. In other contexts, this kind of delay is sometimes called latency. [Ch. 7]
传输延迟在通信网络中,以可用数据速率发送帧所花费的时间对总体延迟的影响。[第 7 章]
Transmission delay In a communication network, the component of overall delay contributed by the time spent sending a frame at the available data rate. [Ch. 7]
传输协议一种端到端协议,用于在网络的两个连接点之间移动数据,同时提供一组特定的指定保证。它可以被认为是对网络层尽力而为规范的一组预先打包的改进。[第 7 章]
Transport protocol An end-to-end protocol that moves data between two attachment points of a network while providing a particular set of specified assurances. It can be thought of as a prepackaged set of improvements on the best-effort specification of the network layer. [Ch. 7]
三重模块冗余 (TMR) N模块冗余,N = 3。[Ch. 8]
Triple-modular redundancy (TMR) N-modular redundancy with N 5 3. [Ch. 8]
可信计算基 (TCB)系统中必须正常运行才能确保整个系统安全的那部分。[第 11 章]
Trusted computing base (TCB) That part of a system that must work properly to make the overall system secure. [Ch. 11]
可信中介代表多个可能互不信任的客户端充当可信第三方的服务。它强制模块化,从而允许多个互不信任的客户端以受控方式共享资源。[第 4 章]
Trusted intermediary A service that acts as the trusted third party on behalf of multiple, perhaps distrustful, clients. It enforces modularity, thereby allowing multiple distrustful clients to share resources in a controlled manner. [Ch. 4]
两将军困境一个内在问题,没有有限协议可以保证同时协调由不可靠通信网络连接的两个地方的状态值。[第 9 章]
Two generals dilemma An intrinsic problem that no finite protocol can guarantee to simultaneously coordinate state values at two places that are linked by an unreliable communication network. [Ch. 9]
两阶段提交一种协议,它从单独的低层事务中创建出高层事务。该协议首先经历一个准备阶段(有时称为投票阶段),在该阶段结束时,每个低层事务都会报告它无法执行其部分,或者它已准备好提交或中止。然后它进入承诺阶段,在该阶段中,充当协调器的高层事务将做出最终决定 - 因此得名两阶段。两阶段提交与发音相似的术语两阶段锁定没有任何关系。[第 9 章]
Two-phase commit A protocol that creates a higher-layer transaction out of separate, lower-layer transactions. The protocol first goes through a preparation (sometimes called voting) phase, at the end of which each lower-layer transaction reports either that it cannot perform its part or that it is prepared to either commit or abort. It then enters a commitment phase in which the higher-layer transaction, acting as a coordinator, makes a final decision—thus the name two-phase. Two-phase commit has no connection with the similar-sounding term two-phase locking. [Ch. 9]
两阶段锁定一种用于前后原子性的锁定协议,要求在获取所有锁之前不释放任何锁(即必须有一个锁点)。为了使原子操作也成为全有或全无,进一步的要求是,在操作提交之前不释放任何要写入的对象的锁。与简单锁定比较。两阶段锁定与发音相似的术语两阶段提交没有任何联系。[第 9 章]
Two-phase locking A locking protocol for before-or-after atomicity that requires that no locks be released until all locks have been acquired (that is, there must be a lock point). For the atomic action to also be all-or-nothing, a further requirement is that no locks for objects to be written be released until the action commits. Compare with simple locking. Two-phase locking has no connection with the similar-sounding term two-phase commit. [Ch. 9]
撤消操作一种应用程序指定的操作,在故障恢复或中止过程中执行时,可撤消某些先前执行但尚未提交的组件操作的效果。目标是原始操作及其撤消操作在实施该操作的层之上都不可见。与重做和补偿进行比较。[第 9 章]
Undo action An application-specified action that, when executed during failure recovery or an abort procedure, reverses the effect of some previously performed, but not yet committed, component action. The goal is that neither the original action nor its reversal be visible above the layer that implements the action. Compare with redo and compensate. [Ch. 9]
唯一标识符名称空间每个名称一旦与某个值绑定,就永远不能重复用于其他值的名称空间。因此,唯一标识符名称空间提供了稳定的绑定。在计费系统中,客户帐号通常构成唯一标识符名称空间。[Ch. 2]
Unique identifier name space A name space in which each name, once it is bound to a value, can never be reused for a different value. A unique identifier name space thus provides a stable binding. In a billing system, customer account numbers usually constitute a unique identifier name space. [Ch. 2]
通用名称空间只有一个上下文的命名方案的名称空间。通用名称空间具有以下属性:无论谁使用名称,它都具有相同的绑定。计算机文件系统通常为绝对路径名提供通用名称空间。[Ch. 2]
Universal name space A name space of a naming scheme that has only one context. A universal name space has the property that no matter who uses a name it has the same binding. Computer file systems typically provide a universal name space for absolute path names. [Ch. 2]
值域可以用特定命名方案命名的所有可能值的集合。[第 2 章]
Universe of values The set of all possible values that can be named by a particular naming scheme. [Ch. 2]
无限命名空间一个名称永远不需要重复使用的命名空间。[Ch. 3]
Unlimited name space A name space in which names never have to be reused. [Ch. 3]
不可容忍错误无法检测、无法屏蔽或无法屏蔽的错误或错误类别,因此可能导致故障。与可检测错误、可屏蔽错误和可容忍错误进行比较。 [第 8 章]
Untolerated error An error or class of errors that is undetectable, unmaskable, or unmasked and therefore can be expected to lead to a failure. Compare with detectable error, maskable error, and tolerated error. [Ch. 8]
用户相关绑定一种绑定,其中共享对象使用的名称根据共享对象用户的身份解析为不同的值。[Ch. 2]
User-dependent binding A binding for which a name used by a shared object resolves to different values, depending on the identity of the user of the shared object. [Ch. 2]
用户模式处理器的一种功能,设置后将禁止使用某些处理器功能(例如,更改页面映射地址寄存器)。与内核模式相比较。[第 5 章]
User mode A feature of a processor that, when set, disallows the use of certain processor features (e.g., changing the page-map address register). Compare with kernel mode. [Ch. 5]
利用率给定工作负载的已用容量百分比。[第 6 章]
Utilization The percentage of capacity used for a given workload. [Ch. 6]
值名称所绑定的事物。值可以是真实的物理对象,也可以是来自原始名称空间或不同名称空间的另一个名称。[Ch. 2]
Value The thing to which a name is bound. A value may be a real, physical object, or it may be another name either from the original name space or from a different name space. [Ch. 2]
有效构造软件设计人员为了避免错误而使用的术语。[第 8 章]
Valid construction The term used by software designers for fault avoidance. [Ch. 8]
版本历史记录 对象或变量曾经存在过的所有值的集合,存储在日志存储中。[第 9 章]
Version history The set of all values for an object or variable that have ever existed, stored in journal storage. [Ch. 9]
虚拟地址必须先将其转换为物理地址,然后才能使用它来引用内存。与物理地址比较。[第 5 章]
Virtual address An address that must be translated to a physical address before using it to refer to memory. Compare with physical address. [Ch. 5]
虚拟电路一种用于通过转发网络传输数据流的连接,在某种程度上模拟电路。[第 7 章]
Virtual circuit A connection intended to carry a stream through a forwarding network, in some ways simulating an electrical circuit. [Ch. 7]
虚拟机一种模拟方法,为了最大限度地提高性能,尽可能多地使用物理处理器来实现其自身的虚拟实例。[第 5 章]
Virtual machine A method of emulation in which, to maximize performance, a physical processor is used as much as possible to implement virtual instances of itself. [Ch. 5]
虚拟机监视器实现虚拟机的软件。[第 5 章]
Virtual machine monitor The software that implements virtual machines. [Ch. 5]
虚拟化一种模拟物理对象接口的技术,在某些情况下,使用一个物理实例创建多个虚拟对象,在其他情况下,通过聚合几个较小的物理实例创建一个大型虚拟对象,而在其他情况下,则从不同类型的物理对象创建一个虚拟对象。[第 5 章]
Virtualization A technique that simulates the interface of a physical object, in some cases creating several virtual objects using one physical instance, in others creating one large virtual object by aggregating several smaller physical instances, and in yet other cases creating a virtual object from a different kind of physical object. [Ch. 5]
虚拟内存管理器实现虚拟地址的内存管理器,通过使用页面映射等方式将其解析为物理地址。[第 5 章]
Virtual memory manager A memory manager that implements virtual addresses, resolving them to physical addresses by using, for example, a page map. [Ch. 5]
挥发性存储器一种存储器,其信息保留机制会主动消耗能量。当电源断开时,它会忘记其信息内容。与非挥发性存储器比较。[第 2 章]
Volatile memory A kind of memory in which the mechanism of retaining information actively consumes energy. When one disconnects the power source, it forgets its information content. Compare with non-volatile memory. [Ch. 2]
投票器某些 NMR 设计中使用的一种设备,用于比较具有相同输入的多个名义上相同的复制品的输出。[第 8 章]
Voter A device used in some NMR designs to compare the output of several nominally identical replicas that all have the same input. [Ch. 8]
众所周知的名称(或地址)广为宣传的名称或地址,可以确信它在绑定的值的生命周期内不会发生变化。在美国,紧急电话号码“911”是一个众所周知的名称。在某些文件系统设计中,每个存储设备的扇区或块号 1 都保留为存储设备数据的位置,从而使“1”在该上下文中成为众所周知的地址。[第 2 章]
Well-known name (or address) A name or address that has been advertised so widely that one can depend on it not changing for the lifetime of the value to which it is bound. In the United States, the emergency telephone number “911” is a well-known name. In some file system designs, sector or block number 1 of every storage device is reserved as a place to store device data, making “1” a well-known address in that context. [Ch. 2]
窗口在流量控制中,传输协议的接收端准备从发送端接受的数据量。[第 7 章]
Window In flow control, the quantity of data that the receiving side of a transport protocol is prepared to accept from the sending side. [Ch. 7]
见证证明文件内容的哈希值(通常具有很强的加密能力)。此概念的另一个广泛使用的术语是指纹。[第 10 章]
Witness A (usually cryptographically strong) hash value that attests to the content of a file. Another widely used term for this concept is fingerprint. [Ch. 10]
工作目录在文件系统中,用作默认上下文的目录,用于解析相对路径名。[Ch. 2]
Working directory In a file system, a directory used as a default context, for resolution of relative path names. [Ch. 2]
工作集线程在时间间隔t内引用的所有地址的集合。如果应用程序表现出引用局部性,则与时间间隔t内可能的最大地址数相比,此地址集将很小。[第 6 章]
Working set The set of all addresses to which a thread refers in the interval t. If the application exhibits locality of reference, this set of addresses will be small compared to the maximum number of possible addresses during t. [Ch. 6]
预写日志 (WAL) 协议一种恢复协议,要求在将相应数据安装到单元存储中之前在日志存储中附加日志记录。[第 9 章]
Write-ahead-log (WAL) protocol A recovery protocol that requires appending a log record in journal storage before installing the corresponding data in cell storage. [Ch. 9]
写入撕裂(请参见原子存储)。
Write tearing See atomic storage.
直写缓存的一个属性:写入操作会在确认写入完成之前更新主设备和辅助设备中的值。(不具备直写属性的缓存有时称为后写缓存。)[第 6 章]
Write-through A property of a cache: a write operation updates the value in both the primary device and the secondary device before acknowledging completion of the write. (A cache without the write-through property is sometimes called a write-behind cache.)[Ch. 6]
设计原则和提示以下划线斜体显示。程序名称以小写字母显示。粗体页码位于词汇表中。灰色的页码位于 [在线] 部分。
Design principles and hints appear in underlined italics. Procedure names appear in SMALL CAPS. Page numbers in bold face are in the Glossary. Page numbers that are greyed out are in a section that is [on-line].
A
A
乙
B
C
C
德
D
埃
E
F
F
G
G
H
H
我
I
J
J
钾
K
大号
L
米
M
否
N
哦
O
磷
P
问
Q
R
R
年代
S
电视
T
乌
U
五
V
西
W
X
X
是
Y